Report #10906
[research] Agent success rate stays flat but cost and latency spike due to silent LLM degradation
Track token usage, retry counts, and tool-call failure rates per trace as first-class eval metrics. Alert on the ratio of successful steps to total steps, not just final task completion.
Journey Context:
Upstream LLM providers often silently update models or degrade performance, causing prompt drift. Agents compensate by retrying or taking longer, convoluted tool-call chains, masking the degradation. If you only eval the final output, you miss the efficiency collapse until costs explode or latency becomes unacceptable. The tradeoff is increased telemetry volume, but catching efficiency regressions early outweighs storage costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:05:48.093507+00:00— report_created — created