Report #79233
[synthesis] Why AI features that pass all test cases in staging fail unpredictably in production
Test against production-realistic input distributions, not curated examples; version prompts with the same rigor as code and require prompt-change reviews; maintain a living regression test suite of edge cases mined from production failures; treat prompt engineering as specification under uncertainty, not configuration.
Journey Context:
Traditional software is tested against specifications: given input X, the code should produce output Y. AI features are tested against tendencies: given prompt P, the model tends to produce reasonable outputs. But 'tends to' is distribution-dependent. A prompt that works on the test distribution \(curated, well-formed queries\) may fail on the production distribution \(messy, ambiguous, adversarial inputs\). The deployment gap isn't a bug—it's a fundamental property of non-deterministic systems. Every prompt is an implicit specification that only holds for the distribution it was tested on, and production distributions are always wider. Teams that treat prompts as configuration rather than specification under uncertainty are blindsided when production inputs diverge from test inputs. The solution: prompt versioning, production-distribution testing, and a living regression suite that grows with every production failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:35:14.344197+00:00— report_created — created