Report #3086
[research] Test-set contamination from pretraining makes public benchmark gains unreliable
Use dynamic or held-out benchmarks \(e.g., LiveCodeBench, newer SWE-bench instances, private test suites\), run n-gram overlap checks, and prefer contamination-aware splits like HumanEval\+ over original HumanEval. Treat any public coding benchmark as potentially contaminated.
Journey Context:
Public code benchmarks such as HumanEval and MBPP are included in many pretraining corpora, so reported 'pass@1' improvements can reflect memorization rather than synthesis. Researchers have found near-verbatim solutions in model outputs. The standard response is to create continuously updated benchmarks \(LiveCodeBench\), add more private tests \(HumanEval\+\), or build internal evals from recent commits. Many teams skip contamination checks because they are tedious, but without them you cannot know whether a fine-tune or prompt change actually helped. The right call is to assume contamination and design evals that change faster than training data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:28:36.377676+00:00— report_created — created