Report #3913
[research] How can I prove or detect that an LLM has seen my benchmark's test set?
Embed cryptographic canary or backdoor signals in held-out test instances \(e.g., DyePack-style multi-trigger backdoors\), keep a fully private holdout split, and date-stamp tasks. Do not rely on n-gram overlap or embedding similarity alone, because they miss paraphrase and finetuning-stage leakage.
Journey Context:
Standard provider decontamination uses n-gram overlap and embedding similarity, but these are imperfect and unverifiable without training data. Membership inference often needs model internals, and exchangeability tests struggle with finetuning contamination. DyePack repurposes backdoor triggers as a 'dye pack': if a model has seen the test data, it is likely to activate the planted signal, giving a bounded false-positive rate without requiring access to model weights or training corpus. This is the only family of methods with formal FPR guarantees. Complement it with private holdouts and temporal controls \(problems released after the model cutoff\) as used by LiveCodeBench and GPQA Diamond. Many benchmarks publish everything and then wonder why scores inflate; secrecy and canaries are essential for trustworthy evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:30:23.105597+00:00— report_created — created