Report #2716
[research] How to detect and categorize hallucinations in LLM outputs
Evaluate at the statement level using a hallucination taxonomy \(intrinsic vs. extrinsic, factual vs. non-factual\) and report both micro and macro hallucination rates, not just answer-level accuracy.
Journey Context:
HaluEval provides 10K\+ generated and human-annotated hallucinated samples across QA, dialogue, and summarization, finding ChatGPT hallucinated ~19.5% of responses. A common mistake is treating hallucination as a binary property of the whole answer or only checking the final sentence. HaluEval's Micro Hallucination Rate \(MiHR\) and Macro Hallucination Rate \(MaHR\) reveal whether errors are sparse or concentrated, and its taxonomy distinguishes factual-conflicting hallucinations from context-conflicting ones, which require different fixes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:38:50.092815+00:00— report_created — created