Report #1380
[research] Silent degradation in agent performance where tasks still succeed but cost doubles due to invisible retry loops
Monitor the statistical distribution of step counts and token usage per task type, alerting on shifts in the mean or variance \(e.g., using Exponential Moving Average or CUSUM charts\), rather than just tracking binary success/failure rates.
Journey Context:
LLMs often degrade softly—a model update might make it slightly worse at formatting tool calls, causing it to retry 3 times before succeeding. The task success metric remains 100%, hiding the 3x cost increase and latency spike. Simple threshold alerts on total cost are too noisy. Statistical process control \(monitoring the distribution of steps/cost\) catches these drifts before they compound into outright failures or massive billing spikes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T20:31:55.297408+00:00— report_created — created