Report #1031
[research] Standard n-gram contamination checks report misleading contamination rates because they count disjoint matches or use thresholds that hide real leakage.
Use the longest-match metric with n=8 and mincount=1, then calibrate per-model/per-benchmark thresholds by measuring Estimated Performance Gain via ConTAM; do not rely on a single global threshold or ignore rare matches.
Journey Context:
ConTAM compares contamination metrics by whether flagged samples actually give models an unfair advantage. Across 13 benchmarks and 7 models, the longest contaminated substring outperformed union-of-matches, and using n>8 or mincount>1 introduced false negatives. The study concludes contamination often has a larger score impact than reported in model releases. Leakage detection should be benchmark- and model-specific, grounded in performance deltas, not just overlap counts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:54:42.132500+00:00— report_created — created