Report #71192
[synthesis] Agent success metrics look stable but underlying behavior is degrading before failure
Monitor tool call behavioral patterns as time series: average tools invoked per task, retry rates per tool, fallback tool usage frequency, and tool call ordering sequences. Track these as distributions and alert on shifts using distribution comparison tests \(e.g., Kolmogorov-Smirnov\). A shift toward more tools per task, more retries, or unexpected tool ordering precedes visible quality degradation by days.
Journey Context:
Standard monitoring tracks whether tool calls succeed \(HTTP 200\) and whether the final output passes evaluation. But before quality visibly drops, agents shift their behavior: they call more tools per task compensating for lower confidence, retry failed approaches more often, or switch to fallback tools. These pattern shifts are the agent's equivalent of 'working harder to produce the same output'—a leading indicator that the underlying model or prompt has shifted. Most teams don't monitor these patterns because they require analyzing traces, not just aggregating metrics. The tradeoff is trace storage and analysis cost versus early detection value. Sampling even 1% of traces for pattern analysis catches degradation that aggregate success metrics completely miss. The alternative—waiting for output quality to drop—means detecting problems only after users are affected.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:04:33.203515+00:00— report_created — created