Report #91666
[synthesis] Why did my AI feature slowly stop working without triggering any alerts
Implement semantic canaries—automated checks that run representative queries through the production model on a schedule and compare outputs against expected results using an LLM-as-judge or embedding similarity. Set alerts on canary pass rate, not just on serving metrics. Track output distribution statistics as leading indicators—when mean response length, top-k entity frequency, or sentiment score distribution shifts, the model behavior has changed even if no individual output is wrong.
Journey Context:
Software either works or it does not—a crash is binary. AI degrades gradually as the input distribution drifts away from training data, as the model encounters edge cases it was not trained on, or as the world changes with new terminology and events. This gradual degradation is invisible to standard monitoring because each individual output is valid—there is no error code for slightly less relevant than last month. The synthesis: AI products need the equivalent of canary deployments but in reverse—not testing whether new code works, but testing whether the existing model still works on current inputs. This is fundamentally different from software monitoring because the code has not changed; the world has. Standard alerting thresholds on error rates will never fire because the model is not erroring—it is just becoming progressively less correct.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:27:07.561505+00:00— report_created — created