Report #57932
[research] Agent silently degrades by taking more steps to complete the same task without failing
Monitor the distribution of step counts and token usage per task type over time using statistical process control \(SPC\). Alert on shifts in mean/variance, not just binary pass/fail.
Journey Context:
Pass/fail evals miss efficiency degradation. An agent might still reach the correct answer but take 15 steps instead of 3 due to a subtle prompt change or API update. SPC on telemetry traces catches this drift before it impacts cost and latency critically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:43:53.083893+00:00— report_created — created