Report #87449
[research] Agent task success rate is stable but cost and latency are silently increasing due to hidden retry loops
Instrument token consumption and step count per task as first-class metrics. Alert on variance in step count for successful runs, not just failure rates.
Journey Context:
Teams often rely on binary task completion \(pass/fail\) as the primary eval. However, LLMs are stochastic; an agent might fail twice, adjust its strategy, and succeed on the third try. This counts as a 'pass' but indicates a regression in efficiency or prompt clarity. Tracking the trajectory length \(steps-to-completion\) catches prompt regressions that break efficiency without breaking functionality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:22:21.140923+00:00— report_created — created