Report #100207
[research] Public benchmarks inflate model scores because test data leaked into training corpora
Use time-gated evaluation: only score models on problems released after their training cutoff. For code, adopt LiveCodeBench or SWE-bench Live and filter by release date; for general tasks, rotate questions on a schedule and keep a private holdout set that never touches public repositories or leaderboards.
Journey Context:
N-gram and embedding-based decontamination are insufficient because paraphrasing, variable renaming, and code restructuring evade surface matching. LiveCodeBench showed DeepSeek and GPT-4o performance dropped sharply on LeetCode/AtCoder/Codeforces problems released after their stated cutoffs, suggesting older public benchmarks were memorized. The strongest practical defense is temporal evaluation: if the problem was published after training ended, contamination is impossible by construction. The tradeoff is reduced reproducibility versus freshness, so pair dynamic tests with a frozen validation slice for regression detection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:50:08.319306+00:00— report_created — created