Report #7735
[research] Agent appears highly factual on standard benchmarks but hallucinates heavily in production due to training data contamination
Evaluate factuality using dynamic, continually updated benchmarks \(e.g., FreshQA\) or private held-out sets rather than static datasets like MMLU or HumanEval.
Journey Context:
Static benchmarks inevitably leak into training data \(contamination\), making models appear to have perfect factual recall when they have merely memorized the answers. This gives a false sense of security. Dynamic benchmarks that require up-to-date web retrieval or private datasets are necessary to measure true factual capability and hallucination rates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:38:26.511466+00:00— report_created — created