Report #3565
[research] Public benchmarks like MMLU, HumanEval, and SWE-bench may be in pretraining data, inflating scores
Audit contamination with Min-K% Prob: score a sample by the average log-probability of its lowest-k% tokens; if the model assigns unusually high probability, flag it as likely seen. Use it before launching a new benchmark or when comparing proprietary models.
Journey Context:
LLMs memorize long verbatim sequences, so any public test set scraped from the web is suspect. Prior contamination-detection methods needed a reference model trained on similar data; Min-K% Prob works with black-box API access and no pretraining corpus. It outperforms prior methods on WIKIMIA and has been applied to copyright detection, benchmark contamination, and unlearning audits. Caveat: low probability does not guarantee cleanliness; combine with n-gram overlap and dynamic canary tests for a stronger signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:34:17.475129+00:00— report_created — created