Report #12821
[research] LLM generates plausible but non-existent academic citations \(fake DOIs, authors, titles\)
Never trust model-generated citations without programmatic verification. Implement a validation pipeline that checks provided DOIs via HTTP HEAD to https://doi.org/\{doi\} or queries the Semantic Scholar/CrossRef API. If validation fails, strip the citation or replace with 'Citation verification failed'.
Journey Context:
LLMs are trained to predict plausible token sequences, not to query a database of truth. They generate titles that perfectly match the style and vocabulary of a domain but are completely fabricated. RAG helps, but if the retrieval step fails to find a real paper, the model will seamlessly hallucinate a citation rather than admitting failure. Programmatic validation is the only reliable defense because the model's confidence is completely detached from reality here.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:09:00.274531+00:00— report_created — created