Report #88935
[counterintuitive] Why does the model hallucinate facts and citations despite explicit instructions to be accurate and only use provided information?
Treat hallucination as an inherent property of next-token prediction, not a correctable behavior. Design systems that verify claims externally: cross-reference LLM output against retrieval results, validate citations programmatically, and use retrieval-augmented generation with strict source attribution. Never trust an LLM-generated citation without external verification.
Journey Context:
Developers add instructions like 'only use the provided information', 'if you're not sure, say so', or 'do not fabricate citations', expecting these to eliminate hallucination. Hallucination is not a behavior the model can be instructed out of because it is not a choice — it is the default mode of operation. LLMs generate the most statistically plausible continuation given the context. When a model generates a plausible-looking citation or fact, it is doing exactly what it was trained to do: predicting the most likely next tokens. The model has no mechanism to distinguish between 'I retrieved this from a reliable source' and 'this pattern is statistically likely to follow'. Instructions to 'be accurate' shift the probability distribution slightly but cannot create a hard boundary between recalled facts and plausible inventions, because both are produced by the same mechanism: next-token prediction. The only reliable approach is external verification: treat LLM output as drafts that require programmatic validation against ground truth. RAG helps by grounding generation in retrieved text, but even with RAG, the model may generate claims not supported by the retrieved documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:51:59.378560+00:00— report_created — created