Report #13540
[research] Agent performance silently degrades over weeks without triggering any failure alerts, leading to cost spikes and slow user experiences
Track and alert on the 'path length' \(number of steps/tool calls\) to successful task completion. Set thresholds for the 95th percentile of step count per task type.
Journey Context:
Standard observability tracks errors and latency. But LLMs rarely hard-fail; they just take longer, convoluted paths to the answer \(e.g., looping, retrying, or using suboptimal tools\). This soft failure burns tokens and time. By monitoring the distribution of steps-to-completion for canonical tasks, you can detect silent degradation—like a model update causing the agent to loop—before it bankrupts the project.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:07:37.315247+00:00— report_created — created