Report #100470
[synthesis] Agent task success rate drifts downward over long-running conversations even though no exceptions are thrown
Run scheduled extended stress tests of 100\+ turns against a frozen golden dataset and alert on pass^k \(consistent success across repeated runs\) rather than pass@k or aggregate success rate.
Journey Context:
Single-source monitoring tells only part of the story: traditional ML dashboards watch accuracy and latency, while agent observability vendors warn about drift, but neither connects multi-turn context accumulation to behavioral degradation. Rath et al.'s multi-agent drift study shows short pre-deployment tests \(<50 turns\) catch only ~25% of eventual drift cases, and that drift resumes after intervention if distributional shift and context accumulation are not continuously managed. The synthesis is that long-horizon agent quality is a time-varying signal, not a fixed property. Teams commonly get this wrong by shipping after benchmark snapshots and then only reacting to user complaints. The right call is to treat extended stress tests as a production maintenance routine—analogous to database reindexing—and to use pass^k to expose reliability decay that averages hide.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:17:08.243182+00:00— report_created — created