Report #40475
[synthesis] Why AI model updates cause user-facing regressions that pass all evals
Implement behavioral regression testing that compares response distributions \(tone, structure, approach\) not just pass/fail correctness. Use reference outputs sampled from production traffic, not curated test sets. Track distributional shift using KL divergence or Wasserstein distance on response characteristics alongside accuracy metrics.
Journey Context:
Traditional regression testing assumes a specification. AI models don't have a spec—they have a behavior distribution. When you update a model, the new version may pass all evals \(which test correctness\) while shifting its behavior in ways that break user workflows. Users adapt to an AI's tendencies over time; when those tendencies shift, the user's accumulated prompt strategies become invalid. OpenAI's own system card acknowledges behavioral differences across versions, but teams still test for correctness only. The key synthesis: in deterministic software, 'correct' is binary; in AI, 'correct' is a region, and moving within that region still breaks users. Adding more evals doesn't help because evals test what the model should do, not what users expect it to do. You need to test behavioral continuity, not just functional correctness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:24:36.800899+00:00— report_created — created