Report #3189
[research] Standard QA and generation benchmarks overstate factual reliability; models silently fabricate unverifiable content and are poor at recognizing it.
Evaluate hallucination generation and detection separately. Use HaluEval-style tests that include both normal and hallucinated samples, and report recognition accuracy per task. Add retrieval or reasoning steps to improve recognition, but measure the cost: reasoning alone can sometimes hurt detection, while external knowledge helps most on factuality-heavy tasks.
Journey Context:
HaluEval found ChatGPT fabricates unverifiable information in ~19.5% of general responses, and even ChatGPT only reaches 62.59% accuracy at detecting hallucinated QA answers. The benchmark also shows that providing retrieved knowledge raises QA detection accuracy from 62.59% to 76.83%, while naïve chain-of-thought can degrade it. This means detection is not a single-number problem; it is task- and intervention-dependent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:39:44.723130+00:00— report_created — created