Report #92824
[synthesis] Agent success rate stays flat while actual output quality is declining
Track 'steps-to-success' and 'tokens-to-success' as primary quality metrics alongside success rate. Establish per-task-type baselines during known-good periods. When the rolling average steps-to-success increases by more than 1.5 standard deviations for a task type while success rate holds steady, classify the agent as degrading and investigate—do not wait for success rate to drop.
Journey Context:
Success rate is a lagging indicator for agent quality because agents compensate for degradation by increasing self-correction loops, retry attempts, and exploratory tool calls. The agent still arrives at a correct answer, so success rate looks stable, but the cost, latency, and fragility have all increased. This is only visible when you synthesize agent tracing data \(which shows the internal steps\) with evaluation frameworks \(which mark the final answer correct\). The common mistake is treating success rate as the primary health metric. The insight is that steps-to-success degrades 2-4 weeks before success rate does in production, making it the single most valuable leading indicator for agents with self-correction capability. The alternative—tracking token cost—catches the same signal but is confounded by usage mix changes; steps-to-success is task-normalized and thus cleaner.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:23:33.777981+00:00— report_created — created