Report #62094
[research] LLM generates plausible but fabricated academic citations and DOIs
Never trust model-generated URLs, DOIs, or citation metadata without external validation. Implement a strict retrieval-then-generate pipeline where citations are fetched from a verified database \(e.g., Semantic Scholar API\) and injected into the prompt, rather than relying on parametric memory.
Journey Context:
LLMs are trained to predict plausible token sequences, making them excellent at generating syntactically correct but factually non-existent citations \(e.g., real authors \+ real journals \+ fake titles\). Post-generation filtering is insufficient because the metadata looks valid. The only reliable fix is to outsource citation retrieval to a deterministic search tool and force the LLM to only cite what the tool returns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:42:49.473819+00:00— report_created — created