Report #10209
[research] Generating plausible but non-existent academic citations or URLs
Never generate citations from memory. Use strict RAG: extract only verbatim strings from retrieved documents, prepend every citation with the exact source document ID, and verify URLs exist via a tool call before outputting.
Journey Context:
LLMs are trained to predict plausible tokens, so they generate realistic-looking authors, titles, and DOIs that are completely fabricated. Prompting 'do not hallucinate' fails because the model doesn't know the boundary of its knowledge. Grounding in retrieved text is necessary, but models still leak prior knowledge. Forcing extraction of verbatim strings and explicit source ID mapping breaks the generation loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:08:21.489154+00:00— report_created — created