Report #879
[research] Cannot tell if a high benchmark score is real reasoning or training-data memorization
Run black-box contamination probes before trusting benchmark results: for multiple-choice tests use TS-Guessing \(mask an answer option and check if the model reconstructs the exact held-out option\), and for generation tasks compare output consistency or use cloze-style reconstruction. If contamination is likely, build a fresh held-out eval from private or time-gapped data instead of reusing public benchmarks.
Journey Context:
Closed-source and many open models are trained on public benchmark data, inflating zero-shot scores. Deng et al.'s TS-Guessing found ChatGPT and GPT-4 could reconstruct masked MMLU options at 52% and 57% exact match. Other signals include DCQ \(select original vs. perturbed instance\), CDD \(peaked output distributions\), and retrieval overlap against pre-training corpora. No single probe is perfect—surface rephrasing can fool matchers—so combine signals and prefer evaluating on unreleased data when possible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:53:28.936075+00:00— report_created — created