1. From “Ghost References” to a Systemic Integrity Crisis
Ghost references are citations to works that do not exist. They differ from:
- Citation unfaithfulness: real papers cited for unsupported claims.
- “Zombie” or low‑quality but real papers. [1][11]
Humans have long produced bogus references via typos, copying, or paper mills. LLMs change the scale: one prompt can yield dozens of plausible but nonexistent citations that flow into papers, theses, and reports with minimal friction. [1][3]
📊 By the numbers
- Across 13 state-of-the-art LLMs tested on citation generation in 40 domains, hallucination rates ranged from 14.23% to 94.93%. [11]
- Other studies report 18–69% fabricated references, including one medical study where 47% of ChatGPT references were made up and only 7% were both real and accurate. [2]
GhostCite’s analysis of 2.2M citations in 56,381 AI/ML and security papers (2020–2025) found: [11]
- 1.07% of papers (604) contained invalid or fabricated citations.
- An 80.9% jump in such papers in 2025 alone.
At NeurIPS 2025, over 4,000 papers contained more than 100 hallucinated citations spread across at least 53 accepted papers, including fully made-up references and subtly altered real ones. [5]
⚠️ Section takeaway
Ghost references are a growing integrity problem, not a curiosity. The rest of this article explains why LLMs structurally invent citations, how academic workflows entrench them, and what a realistic mitigation stack requires.
This article was generated by CoreProse
in 4m 1s with 10 verified sources View sources ↓
Why does this matter?
Stanford research found ChatGPT hallucinates 28.6% of legal citations. This article: 0 false citations. Every claim is grounded in 10 verified sources.
2. Why Language Models Are Structurally Prone to Fabricate Citations
LLMs are trained to predict the next token, not to verify facts. When asked for references, the objective is:
- “Produce text that looks like a citation,”
- Not “guarantee this bibliographic object exists.” [2]
They internalize citation patterns from training data:
- Common surname sequences and author formats.
- Journal and conference names, volume/issue patterns, year ranges.
- DOI-like strings and arXiv-like IDs.
So when a model outputs:
“Smith, J. and Doe, A. (2019). Robust Reinforcement Learning in Safety-Critical Systems. Journal of Applied Machine Learning, 12(3), 45–60. https://doi.org/10.1145/1234567.8901234”
it is sampling a plausible pattern, not retrieving a record. Existence is not part of the objective. [2]
GhostCite’s benchmark shows this is universal: all 13 evaluated models hallucinated citations, with higher failure rates in niche or fast-moving domains and lower rates in well-covered areas. [11] This is a structural consequence of pattern-based modeling plus uneven knowledge coverage, not a single-vendor bug.
Citation hallucination is one facet of a broader fabrication tendency:
- Fake URLs that follow familiar structures but do not resolve.
- Invented tool or API outputs formatted like real responses.
- Metadata (authors, DOIs, IDs) that looks right but fails existence checks. [2]
RAG systems try to ground outputs in retrieved documents, but models still often: [1][5]
- Mix real citations with pattern-generated ones.
- Slightly paraphrase titles so they no longer match catalogs.
- Add/drop coauthors or change venues while keeping structure intact.
These small mutations preserve plausibility while breaking verifiability; humans may not notice, but resolvers and databases will.
⚠️ Section takeaway
Given how LLMs are trained, pattern-consistent but nonexistent citations are expected. Any solution must explicitly counter this bias; prompts like “be accurate” are insufficient.
3. How Academic Workflows Turn Hallucinations into “Real” Literature
The deeper risk is how quickly ghost citations acquire the appearance of real scholarship once humans reuse them.
One fabricated article on “education governance and datafication,” hallucinated by AI, accumulated over 40 Google Scholar citations despite never existing as described. [3] To a casual reader, it looked like a legitimate, cited paper.
This pattern is spreading:
- Students’ bogus references often trace back to published papers already containing AI-generated ghost citations. [4]
- Seeing multiple papers cite the same nonexistent work, readers infer legitimacy.
- LLMs then train on these polluted bibliographies and reproduce them. [1][4]
📊 Peer review’s blind spot
The NeurIPS 2025 analysis showed: [5]
- At least 53 accepted papers contained hallucinated or corrupted citations.
- Each had multiple reviewers, yet references were not fully checked.
The HalluCitation study in NLP found 300 ACL-related papers with: [12]
- Wrong arXiv IDs.
- Links to unrelated content.
- Plausible-looking but unresolvable citations.
GhostCite’s survey of 94 researchers revealed weak verification norms: [11]
- 41.5% copy-paste BibTeX without checking.
- 44.4% do nothing when encountering suspicious references.
- 76.7% of reviewers do not thoroughly check references.
- 80.0% never suspect fake citations in reviewed papers.
💼 Structural incentives
- Careers depend on publication and citation counts.
- Conferences face submission surges and reviewer shortages. [12]
- Deadlines and penalties favor speed over depth. [12]
Combined with unverified LLM outputs, these pressures let fabricated citations pass through drafting, review, indexing, and then be re-cited as if solid prior work. [3][11]
⚠️ Section takeaway
The academic ecosystem amplifies LLM hallucinations. Weak verification habits, overloaded review, and metric-driven incentives allow fake references to harden into the scholarly record.
4. Why RAG, Fact-Checking, and Citation Tools Help—but Do Not Solve It
RAG is often marketed as the cure for hallucinations: retrieve from a trusted index, then answer based on those documents. This reduces fabrication pressure but does not eliminate it. [1]
GhostCite’s CiteVerifier shows what targeted tooling can do by automatically checking whether a citation: [11]
- Resolves in major indexes or catalogs.
- Has matching metadata (title, authors, venue, year).
- Is consistent across mentions.
Using this, the authors audited millions of citations and flagged hundreds of papers with invalid or fabricated references, proving large-scale automated validation is feasible.
Yet fact-checking research highlights limits to naive retrieval. Yao et al. find that LLMs: [6][7]
- Better assess national/international news than local or rapidly changing events.
- Handle static facts better than dynamic ones.
- Improve with search results—but also increase incorrect assessments when retrieval returns irrelevant or low-quality pages.
When the web corpus itself is polluted (e.g., by ghost citations), retrieval can amplify confident errors. [1][6]
💡 Uneven gains, uneven risk
- Slow-moving, well-indexed fields benefit most from automated checks.
- Fast-moving, niche, or underrepresented domains remain fragile. [6][7]
- Fields with strong preprint cultures risk subtle ID/title mutations that evade simple matching. [12]
A realistic mitigation stack is layered:
- RAG for drafting, constrained to curated, high-quality corpora where possible.
- Independent citation-verification passes (e.g., CiteVerifier) integrated into writing tools and submission systems. [11]
- Risk-based deployment: restrict or ban automated citation generation in high-stakes domains like clinical guidance. [2]
- Human-in-the-loop checks focused on references in dynamic or underindexed areas. [6][7]
⚠️ Section takeaway
RAG and validation tools are necessary but insufficient. They must be combined with curated corpora, explicit verification stages, and domain-aware human review.
5. The Confidence Trap: Why Models Sound Sure When They Are Wrong
Many users try to manage hallucinations by asking models how “confident” they are in a citation list. This misunderstands what LLMs can report.
When a model says “I am 95% confident,” it is not exposing a calibrated internal probability. It is generating a number that sounds right, based on patterns in training data. [9]
Biases pushing toward overconfidence include:
- Training data dominated by assertive expert prose.
- Instruction tuning and RL that reward decisive, helpful-sounding answers.
- Learned tendencies to output 0.9+ when asked for probabilities, even when guessing. [9]
📊 Workflow implications
In e-discovery, LLM-based relevance scoring has shown: [8]
- Scores can vary between runs on identical documents.
- Rankings and metrics are unstable compared to traditional TAR models.
For citations, this means: [2][8][9]
- A model may assign high “confidence” to a fabricated but pattern-consistent article.
- Users may treat numeric confidence as justification to skip external checks.
- Repeated queries about the same citation can yield different confidence values.
⚡ Evidence-based confidence instead
Researchers advocate deriving confidence from external evidence, not self-report: [6][9]
- Exact or high-overlap matches between citation metadata and retrieved records.
- Cross-source consistency of titles, authors, venues, and years.
- Alignment between cited claims and the content of the referenced work.
A citation should be “high confidence” only when retrieval and matching confirm its existence and relevance—not when the model outputs “0.99.” [6][9]
⚠️ Section takeaway
Self-reported LLM confidence is unreliable for citation integrity. Robust pipelines must separate generation from confidence estimation and base the latter on verifiable external signals.
6. Regulation, Incentives, and a Safer Future for Citations
Ghost citations threaten more than tidy bibliographies. Reliable references underpin trust in scientific claims, policy, and regulation. When citations are unreliable, the epistemic base of decision-making erodes. [11][4]
Work on foundation model regulation frames an “innovation trilemma”: regulators can at most optimize two of: [10]
- Promoting innovation.
- Mitigating systemic risk.
- Providing clear rules.
Innovation is treated as politically non-negotiable, leaving a trade-off between systemic risk mitigation and clarity. Citation integrity is clearly a systemic-risk issue: if models used for guidelines or policy briefs fabricate literature at nontrivial rates, any system relying on citation-based trust is exposed. [2][11]
💼 Policy levers for citation integrity
Possible governance measures include:
-
Mandatory AI-disclosure in manuscripts, specifying whether LLMs were used for:
-
Automated citation validation at submission for major venues, flagging:
-
Citation audits in research assessment, where funders and tenure committees sample recent work for citation validity, not just counts. [11]
Conferences and journals should also adopt explicit AI and citation policies, including: [12][5][4]
- Standards for when AI-generated citations are allowed.
- Author attestations that all references have been verified.
- Sanctions or extra scrutiny for repeated submission of papers with hallucinated citations.
- Reviewer training on common LLM citation failure modes (subtle title shifts, implausible venue-year combos, generic placeholder authors).
💡 Aligning tools, workflows, and rules
Without incentives that reward verification, even excellent tools will be underused. Editorial and funding structures must value citation integrity alongside novelty and volume. [1][2][11]
⚠️ Section takeaway
Citation integrity should be treated as core research infrastructure, like data management and ethics review. Aligning regulation, incentives, and tooling is essential to contain LLM-driven fabrication.
Conclusion
LLMs invent citations because they are pattern completers, not fact retrievers. In an overstretched scholarly ecosystem with weak reference-checking norms, these fabrications quickly acquire the appearance of real literature. [1][2][11]
Mitigation requires:
- Grounded generation on curated corpora.
- Automated, large-scale citation verification.
- Evidence-based confidence signals.
- Institutional reforms that treat citation validity as non-negotiable.
Researchers, editors, and tool builders should now:
- Audit how they use LLMs in writing and review.
- Pilot citation-verification pipelines.
- Push venues and funders to adopt explicit AI and citation-integrity policies.
Acting before ghost references become normalized is crucial to preserving trust in the scientific record.
Sources & References (10)
- 1Why Ghost References Still Haunt Us in 2025—And Why It's Not Just About LLMs
As early as late 2022, I understood that Retrieval Augmented Generation (RAG) would be the future. By grounding LLM responses in retrieved content, RAG should reduce or even eliminate certain types of...
- 2The Fabrication Problem: How AI Models Generate Fake Citations, URLs, and References
The Fabrication Problem: How AI Models Generate Fake Citations, URLs, and References [Image 1] Nayeem Islam 31 min read Jun 12, 2025 You have already faced it ; Now let’s understand why AI tools ...
- 3LLMs Generate Fake Citations in Academic Papers
The problem of bullshit (I think this is a more accurate term than hallucination, an hallucination is a perception problem, LLMs don’t perceive, they have a problem with incorrect outputs, but I digre...
- 4AI Chatbots Are Poisoning Research Archives With Fake Citations
By Miles Klee December 17, 2025 Academic papers increasingly cite articles and publications invented by large language models. Gettу Images As the fall semester came to a close, Andrew Heiss, an ass...
- 5NeurIPS research papers contained 100+ AI-hallucinated citations, new report claims
NeurIPS, one of the world’s top academic AI conferences, accepted research papers with 100+ AI-hallucinated citations, new report claims Sharon Goldman AI Reporter January 21, 2026, 9:00 AM ET Neu...
- 6Fact-checking AI-generated news reports: Can LLMs catch their own lies?
Authors: Jiayi Yao, Haibo Sun, Nianwen Xue Submitted on 24 Mar 2025 Abstract: In this paper, we evaluate the ability of Large Language Models (LLMs) to assess the veracity of claims in "news reports...
- 7Fact-checking AI-generated news reports: Can LLMs catch their own lies?
Abstract In this paper, we evaluate the ability of Large Language Models (LLMs) to assess the veracity of claims in “news reports” generated by themselves or other LLMs. Our goal is to determine wheth...
- 8Why Confidence Scoring With LLMs Is Dangerous
What to know before relying on confidence scoring of LLMs in a document review setting. When it comes to confidence assessments from LLMs, scoring predictions is essential. The most important thing ...
- 9Rethinking Confidence in LLM Data Extraction
Rethinking Confidence in LLM Data Extraction Extracting structured data from invoices using Large Language Models (LLMs) is alluring – just prompt a model like GPT to read an OCR’d invoice and output...
- 10Regulation Priorities for Artificial Intelligence Foundation Models
Regulation Priorities for Artificial Intelligence Foundation Models Matthew R. Gaske * ABSTRACT This Article responds to the call in technology law literature for high-level frameworks to guide reg...
Generated by CoreProse in 4m 1s
What topic do you want to cover?
Get the same quality with verified sources on any subject.