Report #100241
[research] How do I detect silent degradation in production before users complain?
Run online evals on live traffic and watch behavioral signals, not just aggregate error rates. Score a sample or every trace with the same scorers used offline, track tool-call distributions, loop rates, argument patterns, and output embeddings against a baseline, and alert on sustained deviation. Promote low-scoring traces into your offline dataset as regression cases.
Journey Context:
Silent degradation is the default failure mode for deployed agents: provider models get point updates, retrieval indices are re-embedded, prompts are tuned, and tool schemas evolve. Traditional APM stays green because nothing errors out; the agent just gets worse. The fix is a continuous quality layer. Offline evals catch regressions before shipping but miss drift in real usage; online evals close the gap. The practical stack combines OTel traces, per-turn classifiers or LLM judges on a sample, embedding-drift detection for outputs, and a data flywheel that turns production failures into permanent regression tests. The risk is cost, so sample by failure signal or use distilled classifiers rather than a full judge on every turn.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:53:57.060520+00:00— report_created — created