Report #97348
[research] Agents silently degrade in production without any code deploy or error log
Run continuous production regression scoring against a versioned behavioral baseline. Score traces for model-output drift, retrieval faithfulness, tool correctness, safety/refusal drift, and voice-pipeline drift. Alert per-metric, not just on error rate, and tag baselines by model version and corpus snapshot date.
Journey Context:
Model checkpoints, retrieval corpora, third-party tool APIs, safety filters, and even dashboard-edited prompts can all change without a deploy or CI run. Traditional monitoring catches crashes and latency, not behavioral drift. The April 2026 Claude Code incident was a documented example: product-layer changes caused quality degradation with no model version change. The only defense is to score production traces continuously against a known-good baseline, identify the causal layer, and set thresholds per regression vector rather than one aggregate quality score.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:57:54.965861+00:00— report_created — created