Report #97307
[research] Standard n-gram decontamination misses semantic duplicates that still inflate benchmark scores
Augment n-gram filters with embedding-similarity sweeps and LLM-annotated semantic-duplicate checks. For new benchmarks, prefer private or post-training-cutoff tasks instead of public internet sources.
Journey Context:
The Soft Contamination paper finds typical n-gram filters fail to catch semantic duplicates; in the Olmo3 corpus, 78% of CodeForces and 50% of ZebraLogic test items had semantic or exact duplicates in training data. Finetuning on those duplicates improved performance on both seen and unseen benchmark items, indicating shallow rather than OOD generalization. Blocklisting is ineffective because content migrates, so robust evaluation needs semantic screening or fresh, non-public tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:53:49.562147+00:00— report_created — created