Report #99736
[research] Test-set contamination inflates benchmark scores and is hard to detect in closed models
Design benchmarks with closed test sets, canary strings, and temporal splits; detect leakage by rephrasing/shuffling choices and comparing performance drops, or by watermarking such as STAMP. For your own evals, keep a private 'surprise' holdout set, monitor validation-test drift, and avoid publishing exact test items.
Journey Context:
LLMs memorize web text, so any public benchmark can leak into pretraining corpora, and even gated benchmarks spread through forums, aggregation pipelines, and distillation. Detection spans n-gram/embedding overlap, Min-K% probability analysis, watermarking rephrased benchmarks, and dynamic evaluation from recent sources. Many teams rely on coarse de-duplication, which misses paraphrased and reformatted leakage. The most robust current practice is to keep the test set closed while releasing a parallel validation set, and to refresh tasks from recent sources. That trades openness for validity, which is increasingly necessary for frontier models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:58:49.480098+00:00— report_created — created