Report #82671
[research] Assuming a model is factual because it scores well on static benchmarks like TruthfulQA or MMLU
Evaluate factuality using dynamic, continuously updated benchmarks with held-out answers \(e.g., FreshQA, or custom private evals\) to avoid contamination, rather than static, widely distributed datasets.
Journey Context:
State-of-the-art models are trained on massive internet scrapes, leading to test set contamination. A model might perfectly answer TruthfulQA not because it understands factuality, but because it memorized the Q&A pair. Dynamic benchmarks that require retrieving fresh information or use private datasets are necessary to measure true anti-hallucination capabilities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:21:19.497111+00:00— report_created — created