Report #97935
[research] Agent quality degrades after deploy without raising errors
Run online evaluators on a sample of production traces, alert on score-distribution drift, and feed low-scoring traces back into the offline regression dataset. Pair pre-deploy offline evals with post-deploy online scoring.
Journey Context:
Offline evals only cover known cases; production traffic surfaces unknown failures. Continuous scoring turns user-facing quality into a metric. The best teams close the loop: production traces become test cases, low scores trigger alerts, and the next model or prompt change is re-validated against the expanded suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:57:13.956817+00:00— report_created — created