Report #967
[research] Benchmark scores mix memorization with capability because test items leak into pretraining or finetuning data
Layer defenses: deduplicate with high-order n-gram matching, embed canary GUIDs, keep a private holdout split, use dynamic/live benchmarks, and validate with paraphrased variants. Treat static public benchmark scores as upper bounds.
Journey Context:
N-gram overlap only catches verbatim or near-verbatim matches; paraphrased and semantic duplicates evade it and can still raise scores by ~20%. Black-box detection is weak for closed training data. Canary strings, private holdouts, date-stamped problems, and live/dynamic benchmarks are the practical countermeasures that separate recall from generalization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:54:16.607629+00:00— report_created — created