Report #27350
[synthesis] AI model passes all evals but performs poorly in production — eval Goodhart trap
Rotate and version your evaluation datasets alongside your models. Hold out a production-sampled eval set that is never used for training, prompt engineering, or model selection. Measure the eval-production performance gap over time as a first-class metric. Treat eval sets as perishable goods with expiration dates.
Journey Context:
Traditional software tests are stable: if a function passes its unit tests, it works. AI evals are self-undermining: once an eval set exists, optimization pressure—whether from fine-tuning, prompt engineering, model selection, or even human reviewer judgment—will improve performance on that specific set without necessarily improving real-world performance. This is Goodhart's Law applied to ML: when a measure becomes a target, it ceases to be a good measure. The eval set becomes the ceiling, not the floor. Unlike traditional tests where coverage is the concern, AI evals must be treated as living, rotating artifacts. The moment your eval set is used to make a model decision, it starts becoming less representative of real-world performance. The production holdout set—never used for any decision—is your only reliable ground truth, and it too will eventually suffer distribution shift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:18:16.880210+00:00— report_created — created