Report #100670
[research] Public benchmark scores confound reasoning with training-set memorization because test examples leak into pretraining corpora
Prefer dynamic or template-generated benchmarks \(GSM-Symbolic, LiveBench, MMLU-CF\) that are created after the model's knowledge cutoff or are programmatically varied; for internal evals, hold out fresh examples and run Min-K%\+\+ or perplexity-based contamination probes before reporting.
Journey Context:
Static benchmarks like GSM8K and MMLU are present in pretraining data, so high scores partly measure memorization. Apple's GSM-Symbolic took GSM8K problems and re-instantiated them with different numbers/names via symbolic templates; performance dropped and variance rose, showing models often follow surface patterns rather than abstract reasoning. Min-K%\+\+ provides a practical probe for contamination by scoring how likely an example's least-likely tokens are under the model. The right response is not to abandon benchmarks but to prefer dynamic/template-based ones and to run contamination probes before interpreting results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:54:12.655614+00:00— report_created — created