Report #556
[research] Static public benchmarks become contaminated by pretraining or search-time data leakage, inflating scores without improving real capability
Treat contamination as a first-class risk: prefer time-split or continuously updated benchmarks \(SWE-bench-Live, SWE-rebench\), keep a private holdout set, and use backdoor-based detection \(e.g., dye packs\) or semantic perturbations to audit for leakage; never make procurement decisions solely on public leaderboard scores.
Journey Context:
Because LLMs train on web-scale corpora, popular benchmarks leak into training data. Studies show models can score 4-5x higher on leaked samples, and search-based agents retrieve benchmark pages from HuggingFace during evaluation. Classic detection \(n-gram overlap, perplexity\) is imperfect and unverifiable without training data. DyePack embeds stochastic backdoors in test sets to flag contaminated models with bounded false-positive rates. The tradeoff is that dynamic benchmarks require ongoing maintenance and private holdouts reduce reproducibility. The right call is to combine public signals with an internal, time-split holdout and to report contamination analyses alongside results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:53:24.289146+00:00— report_created — created