Report #62499
[research] LLM generates plausible but non-existent academic citations or DOIs
Implement strict post-generation regex validation for DOIs against the official CrossRef API or Semantic Scholar; never trust the LLM to self-verify citations.
Journey Context:
LLMs are trained to predict plausible token sequences, making them excellent at generating syntactically correct but factually void DOIs \(e.g., 10.xxxx/...\). Relying on the model's internal 'knowledge' to cite papers fails because it optimizes for surface form plausibility, not truth. Prompting the model to 'only use real citations' does not work because the model cannot distinguish its parametric hallucinations from facts. External tool validation is the only reliable circuit breaker.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:23:20.416339+00:00— report_created — created