LLMs Invent Citations: 7 Reasons It Still Happens

1. From “Ghost References” to a Systemic Integrity Crisis

Ghost references are citations to works that do not exist. They differ from:

Citation unfaithfulness: real papers cited for unsupported claims.
“Zombie” or low‑quality but real papers. [1][11]

Humans have long produced bogus references via typos, copying, or paper mills. LLMs change the scale: one prompt can yield dozens of plausible but nonexistent citations that flow into papers, theses, and reports with minimal friction. [1][3]

📊 By the numbers

Across 13 state-of-the-art LLMs tested on citation generation in 40 domains, hallucination rates ranged from 14.23% to 94.93%. [11]
Other studies report 18–69% fabricated references, including one medical study where 47% of ChatGPT references were made up and only 7% were both real and accurate. [2]

GhostCite’s analysis of 2.2M citations in 56,381 AI/ML and security papers (2020–2025) found: [11]

1.07% of papers (604) contained invalid or fabricated citations.
An 80.9% jump in such papers in 2025 alone.

At NeurIPS 2025, over 4,000 papers contained more than 100 hallucinated citations spread across at least 53 accepted papers, including fully made-up references and subtly altered real ones. [5]

⚠️ Section takeaway

Ghost references are a growing integrity problem, not a curiosity. The rest of this article explains why LLMs structurally invent citations, how academic workflows entrench them, and what a realistic mitigation stack requires.

This article was generated by CoreProse

in 4m 1s with 10 verified sources View sources ↓

Try on your topic

Why does this matter?

Stanford research found ChatGPT hallucinates 28.6% of legal citations. This article: 0 false citations. Every claim is grounded in 10 verified sources.

2. Why Language Models Are Structurally Prone to Fabricate Citations

LLMs are trained to predict the next token, not to verify facts. When asked for references, the objective is:

“Produce text that looks like a citation,”
Not “guarantee this bibliographic object exists.” [2]

They internalize citation patterns from training data:

Common surname sequences and author formats.
Journal and conference names, volume/issue patterns, year ranges.
DOI-like strings and arXiv-like IDs.

So when a model outputs:

“Smith, J. and Doe, A. (2019). Robust Reinforcement Learning in Safety-Critical Systems. Journal of Applied Machine Learning, 12(3), 45–60. https://doi.org/10.1145/1234567.8901234”

it is sampling a plausible pattern, not retrieving a record. Existence is not part of the objective. [2]

GhostCite’s benchmark shows this is universal: all 13 evaluated models hallucinated citations, with higher failure rates in niche or fast-moving domains and lower rates in well-covered areas. [11] This is a structural consequence of pattern-based modeling plus uneven knowledge coverage, not a single-vendor bug.

Citation hallucination is one facet of a broader fabrication tendency:

Fake URLs that follow familiar structures but do not resolve.
Invented tool or API outputs formatted like real responses.
Metadata (authors, DOIs, IDs) that looks right but fails existence checks. [2]

RAG systems try to ground outputs in retrieved documents, but models still often: [1][5]

Mix real citations with pattern-generated ones.
Slightly paraphrase titles so they no longer match catalogs.
Add/drop coauthors or change venues while keeping structure intact.

These small mutations preserve plausibility while breaking verifiability; humans may not notice, but resolvers and databases will.

⚠️ Section takeaway

Given how LLMs are trained, pattern-consistent but nonexistent citations are expected. Any solution must explicitly counter this bias; prompts like “be accurate” are insufficient.

3. How Academic Workflows Turn Hallucinations into “Real” Literature

The deeper risk is how quickly ghost citations acquire the appearance of real scholarship once humans reuse them.

One fabricated article on “education governance and datafication,” hallucinated by AI, accumulated over 40 Google Scholar citations despite never existing as described. [3] To a casual reader, it looked like a legitimate, cited paper.

This pattern is spreading:

Students’ bogus references often trace back to published papers already containing AI-generated ghost citations. [4]
Seeing multiple papers cite the same nonexistent work, readers infer legitimacy.
LLMs then train on these polluted bibliographies and reproduce them. [1][4]

📊 Peer review’s blind spot

The NeurIPS 2025 analysis showed: [5]

At least 53 accepted papers contained hallucinated or corrupted citations.
Each had multiple reviewers, yet references were not fully checked.

The HalluCitation study in NLP found 300 ACL-related papers with: [12]

Wrong arXiv IDs.
Links to unrelated content.
Plausible-looking but unresolvable citations.

GhostCite’s survey of 94 researchers revealed weak verification norms: [11]

41.5% copy-paste BibTeX without checking.
44.4% do nothing when encountering suspicious references.
76.7% of reviewers do not thoroughly check references.
80.0% never suspect fake citations in reviewed papers.

💼 Structural incentives

Careers depend on publication and citation counts.
Conferences face submission surges and reviewer shortages. [12]
Deadlines and penalties favor speed over depth. [12]

Combined with unverified LLM outputs, these pressures let fabricated citations pass through drafting, review, indexing, and then be re-cited as if solid prior work. [3][11]

⚠️ Section takeaway

The academic ecosystem amplifies LLM hallucinations. Weak verification habits, overloaded review, and metric-driven incentives allow fake references to harden into the scholarly record.

4. Why RAG, Fact-Checking, and Citation Tools Help—but Do Not Solve It

RAG is often marketed as the cure for hallucinations: retrieve from a trusted index, then answer based on those documents. This reduces fabrication pressure but does not eliminate it. [1]

GhostCite’s CiteVerifier shows what targeted tooling can do by automatically checking whether a citation: [11]

Resolves in major indexes or catalogs.
Has matching metadata (title, authors, venue, year).
Is consistent across mentions.

Using this, the authors audited millions of citations and flagged hundreds of papers with invalid or fabricated references, proving large-scale automated validation is feasible.

Yet fact-checking research highlights limits to naive retrieval. Yao et al. find that LLMs: [6][7]

Better assess national/international news than local or rapidly changing events.
Handle static facts better than dynamic ones.
Improve with search results—but also increase incorrect assessments when retrieval returns irrelevant or low-quality pages.

When the web corpus itself is polluted (e.g., by ghost citations), retrieval can amplify confident errors. [1][6]

💡 Uneven gains, uneven risk

Slow-moving, well-indexed fields benefit most from automated checks.
Fast-moving, niche, or underrepresented domains remain fragile. [6][7]
Fields with strong preprint cultures risk subtle ID/title mutations that evade simple matching. [12]

A realistic mitigation stack is layered:

RAG for drafting, constrained to curated, high-quality corpora where possible.
Independent citation-verification passes (e.g., CiteVerifier) integrated into writing tools and submission systems. [11]
Risk-based deployment: restrict or ban automated citation generation in high-stakes domains like clinical guidance. [2]
Human-in-the-loop checks focused on references in dynamic or underindexed areas. [6][7]

⚠️ Section takeaway

RAG and validation tools are necessary but insufficient. They must be combined with curated corpora, explicit verification stages, and domain-aware human review.

5. The Confidence Trap: Why Models Sound Sure When They Are Wrong

Many users try to manage hallucinations by asking models how “confident” they are in a citation list. This misunderstands what LLMs can report.

When a model says “I am 95% confident,” it is not exposing a calibrated internal probability. It is generating a number that sounds right, based on patterns in training data. [9]

Biases pushing toward overconfidence include:

Training data dominated by assertive expert prose.
Instruction tuning and RL that reward decisive, helpful-sounding answers.
Learned tendencies to output 0.9+ when asked for probabilities, even when guessing. [9]

📊 Workflow implications

In e-discovery, LLM-based relevance scoring has shown: [8]

Scores can vary between runs on identical documents.
Rankings and metrics are unstable compared to traditional TAR models.

For citations, this means: [2][8][9]

A model may assign high “confidence” to a fabricated but pattern-consistent article.
Users may treat numeric confidence as justification to skip external checks.
Repeated queries about the same citation can yield different confidence values.

⚡ Evidence-based confidence instead

Researchers advocate deriving confidence from external evidence, not self-report: [6][9]

Exact or high-overlap matches between citation metadata and retrieved records.
Cross-source consistency of titles, authors, venues, and years.
Alignment between cited claims and the content of the referenced work.

A citation should be “high confidence” only when retrieval and matching confirm its existence and relevance—not when the model outputs “0.99.” [6][9]

⚠️ Section takeaway

Self-reported LLM confidence is unreliable for citation integrity. Robust pipelines must separate generation from confidence estimation and base the latter on verifiable external signals.

6. Regulation, Incentives, and a Safer Future for Citations

Ghost citations threaten more than tidy bibliographies. Reliable references underpin trust in scientific claims, policy, and regulation. When citations are unreliable, the epistemic base of decision-making erodes. [11][4]

Work on foundation model regulation frames an “innovation trilemma”: regulators can at most optimize two of: [10]

Promoting innovation.
Mitigating systemic risk.
Providing clear rules.

Innovation is treated as politically non-negotiable, leaving a trade-off between systemic risk mitigation and clarity. Citation integrity is clearly a systemic-risk issue: if models used for guidelines or policy briefs fabricate literature at nontrivial rates, any system relying on citation-based trust is exposed. [2][11]

💼 Policy levers for citation integrity

Possible governance measures include:

Mandatory AI-disclosure in manuscripts, specifying whether LLMs were used for:
- Drafting prose.
- Suggesting references.
- Generating BibTeX/metadata. [11][5]
Automated citation validation at submission for major venues, flagging:
- Non-resolving DOIs/URLs.
- Mismatched titles, authors, or venues.
- Citations absent from recognized indexes. [11][12]
Citation audits in research assessment, where funders and tenure committees sample recent work for citation validity, not just counts. [11]

Conferences and journals should also adopt explicit AI and citation policies, including: [12][5][4]

Standards for when AI-generated citations are allowed.
Author attestations that all references have been verified.
Sanctions or extra scrutiny for repeated submission of papers with hallucinated citations.
Reviewer training on common LLM citation failure modes (subtle title shifts, implausible venue-year combos, generic placeholder authors).

💡 Aligning tools, workflows, and rules

Without incentives that reward verification, even excellent tools will be underused. Editorial and funding structures must value citation integrity alongside novelty and volume. [1][2][11]

⚠️ Section takeaway

Citation integrity should be treated as core research infrastructure, like data management and ethics review. Aligning regulation, incentives, and tooling is essential to contain LLM-driven fabrication.

Conclusion

LLMs invent citations because they are pattern completers, not fact retrievers. In an overstretched scholarly ecosystem with weak reference-checking norms, these fabrications quickly acquire the appearance of real literature. [1][2][11]

Mitigation requires:

Grounded generation on curated corpora.
Automated, large-scale citation verification.
Evidence-based confidence signals.
Institutional reforms that treat citation validity as non-negotiable.

Researchers, editors, and tool builders should now:

Audit how they use LLMs in writing and review.
Pilot citation-verification pipelines.
Push venues and funders to adopt explicit AI and citation-integrity policies.

Acting before ghost references become normalized is crucial to preserving trust in the scientific record.

Why LLMs Invent Academic Citations That Don't Exist (and How to Stop Them)

1. From “Ghost References” to a Systemic Integrity Crisis

2. Why Language Models Are Structurally Prone to Fabricate Citations

3. How Academic Workflows Turn Hallucinations into “Real” Literature

4. Why RAG, Fact-Checking, and Citation Tools Help—but Do Not Solve It

5. The Confidence Trap: Why Models Sound Sure When They Are Wrong

6. Regulation, Incentives, and a Safer Future for Citations

Conclusion

Sources & References (10)

What topic do you want to cover?

Continue reading

Why LLMs Invent Academic Citations—and How to Stop Ghost References

Why LLMs Invent Academic Citations That Don’t Exist—and How to Stop the Ghost Citation Loop

Ars Technica’s AI Retraction: What Fabricated Quotes Reveal About Newsrooms and AI Governance