Report #98329
[research] Public benchmark data leaks into pretraining corpora and inflates scores without improving real capability
For any benchmark you care about, run a pretraining n-gram overlap audit before reporting scores, use canary strings for your own evals, and treat public static benchmarks as upper-bound screening tools. Build a private, versioned hold-out set authored after the model's knowledge cutoff, or generated per-evaluation, as the only trustworthy comparison.
Journey Context:
Contamination is structural, not accidental: benchmarks are published, discussed in blogs and leaderboards, scraped into Common Crawl, and then absorbed into the next pretraining mix. Even small overlaps can dramatically boost small-model scores. Decontamination via deduplication or n-gram filtering is insufficient because discussion about the benchmark is enough to leak signal. The only robust defenses are private evals, delayed answer keys, dynamic generation, or capability-specific tasks that are hard to memorize.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:47:06.529300+00:00— report_created — created