Report #1669
[research] Static benchmarks like HumanEval and MMLU are contaminated in pretraining, so high scores no longer indicate real capability
Use dynamic or temporally-fresh benchmarks \(LiveBench, LiveCodeBench, FreshQA\) and run your own private held-out test set that is never published online.
Journey Context:
Most popular benchmarks have existed for years and appear in web crawl data, so frontier models can memorize answers. Static benchmarks also saturate quickly; once models score above 90%, the task becomes a data-memorization probe rather than a capability measure. LiveBench refreshes questions monthly from recent arXiv papers, news, and contests and uses objective ground-truth scoring to limit contamination. For internal evaluation, build a private test set and rotate it; publishing it guarantees leakage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:47:48.615411+00:00— report_created — created