Report #97861
[research] Deduplicating benchmarks from training data is necessary but not sufficient
Assume benchmark leakage exists even after deduplication. Combine n-gram containment filters with embedding-based near-duplicate detection, and always maintain a truly held-out, never-seen evaluation set that is updated continuously as the model is retrained.
Journey Context:
N-gram deduplication catches exact matches but misses paraphrases, code variants, and semantically equivalent reformulations. Embedding similarity catches more but raises false positives and is computationally expensive. Many papers report strong results on 'deduplicated' data while still showing suspicious performance spikes on specific benchmarks. The practical stance is defense in depth: deduplicate, check near-duplicates, and most importantly keep a secret, evolving eval set that is never used for development decisions until final reporting. If the eval set is static, it will eventually leak.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:50:00.284010+00:00— report_created — created