Report #58432
[synthesis] Why do AI products pass all evals in staging but fail in production with no code changes
Build living evaluation sets that continuously sample from production traffic, have humans label a representative subset, and refresh the eval set monthly. Weight recent samples more heavily than old ones. Track eval performance by user cohort and use case, not just aggregate. Treat eval sets as versioned, evolving artifacts with their own changelog and deprecation schedule.
Journey Context:
Traditional software has a fixed specification: if tests pass, the software works. AI products have a shifting specification because what counts as 'good' changes as users discover new use cases, as the input distribution shifts, and as the competitive landscape redefines expectations. The synthesis of specification engineering with ML evaluation methodology reveals that static eval sets become stale within weeks as production distribution drifts away from the eval distribution—and this drift is invisible because the eval set itself doesn't change. A model can score 95% on a stale eval while its actual production quality degrades significantly because users are now asking questions the eval set doesn't cover. Teams commonly build eval sets once, validate them, and run them indefinitely. The right call is treating eval sets as living documents that are continuously refreshed from production, with explicit versioning and deprecation of stale samples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:34:03.449479+00:00— report_created — created