Report #2031
[research] Public benchmark scores are inflated because test data leaks into LLM pre-training corpora
For internal evals, keep a private holdout split with canary strings/GUIDs; use dynamic or post-cutoff benchmarks \(LiveCodeBench, WikiMIA, Humanity's Last Exam private set\); run decontamination checks with n-gram \+ embedding similarity \+ Min-K% Prob; and compare models on contamination-free variants like MMLU-CF or SWE-rebench.
Journey Context:
N-gram matching catches verbatim leakage but fails against paraphrases — a 13B model trained on paraphrased test sets reached GPT-4-level MMLU/GSM8K/HumanEval scores. Min-K% Prob \(flagging tokens with anomalously low perplexity\) is the most popular open-data detection method but misses frequency-calibration issues. For black-box models, membership inference and exchangeability tests are partial. Because closed-source training data cannot be audited, prevention beats detection: private test sets, date-stamped problems, and canary tokens are the only robust guarantees. Dynamic benchmarks complicate longitudinal comparison but are essential for frontier models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:48:34.346764+00:00— report_created — created