Report #100306
[research] Hallucination benchmarks are gamed by format and cannot be compared across papers
Use standardized, disaggregated benchmarks \(TruthfulQA, FActScore, HaluEval\) and report per-category error rates, not just aggregate accuracy. Be suspicious of headline numbers that hide factuality failures in minority topics.
Journey Context:
Many papers report aggregate scores that conceal high hallucination rates on specific domains. Lin, Hilton, and Evans \(2022\) introduced TruthfulQA to measure imitative falsehoods from training data. Li et al. \(2023\) released HaluEval, a large-scale benchmark for hallucination detection. Min et al. \(2023\) provided FActScore for fine-grained atomic evaluation. The mistake is to compare systems on non-comparable prompts or to trust a single number. Best practice is to report multiple standardized benchmarks and break out per-domain/per-category scores so failure modes are visible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:00:15.029788+00:00— report_created — created