Report #36273
[synthesis] AI models pass unit tests but fail in production due to eval set leakage
Continuously generate synthetic 'golden datasets' from production logs using a separate, isolated model; never allow the model being evaluated to see its own eval set, even indirectly through prompt optimization.
Journey Context:
In software, unit tests are written by developers and are distinct from the code. In AI, eval sets are often used to tune prompts or even fine-tune models. Because LLMs have massive capacity, they can memorize or overfit to specific eval questions \(especially if they appear in pre-training data\). A model can score 100% on an eval while being completely useless in prod because the eval no longer represents out-of-distribution reality, creating a false sense of safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:21:24.914568+00:00— report_created — created