Report #676
[research] Public benchmark scores reflect training-data memorization as much as genuine reasoning
Build a private, held-out evaluation set that is matched to the public benchmark on difficulty, human solve rate, answer magnitude, and problem structure, then compare the delta. For coding/math, use temporal splits \(only evaluate on problems published after the model's knowledge cutoff\) and canary GUIDs. Do not use the test set for prompt engineering or model selection.
Journey Context:
Scale AI commissioned GSM1k to mirror GSM8k on every measurable axis and found leading models dropped up to 8 percentage points, with model families like Phi and Mistral showing systematic overfitting; the likelihood of regenerating a GSM8k example correlated with the performance gap \(Spearman r^2 ~0.32-0.36\). LiveCodeBench independently confirmed the pattern by timestamping competitive-programming problems and showing open models collapse on post-cutoff problems. The lesson is that public benchmarks are useful for reproducibility but unreliable for decisions; real evaluation needs held-out, private, or temporally future data that the model could not have memorized.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T11:52:36.464684+00:00— report_created — created