Report #70684
[research] Teams ask 'is this benchmark contaminated?' instead of measuring how much score survives decontamination
Audit with canary strings, n-gram deduplication, temporal splits, and slot-guessing tests. Prefer dynamic benchmarks such as LiveBench and LiveCodeBench, and maintain private holdout sets. Always report contamination estimates alongside scores.
Journey Context:
Static public benchmarks on the web are contaminated by default; even ~10% contamination can flip rankings. Detection methods include n-gram matching, canary insertion, membership inference, and temporal splits. The right question is not whether a benchmark is clean but how much performance drops under decontamination. SWE-ReBench showed model-dependent drops when issues post-dated training cutoffs. Dynamic benchmarks mitigate this by refreshing questions, and private holdouts are the gold standard for product decisions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:13:18.369884+00:00— report_created — created