Report #35737
[research] Agent performance degrades silently over iterations without triggering explicit errors
Track task completion cost \(tokens used / steps taken\) and tool error rates as leading indicators, not just binary task success. Set alerting thresholds on step-count variance.
Journey Context:
An LLM update might make an agent slightly worse at formatting a tool call. The agent retries and eventually succeeds, so the task success metric stays 100%, but the cost triples and latency spikes. Binary pass/fail evals miss this. Observability must track the efficiency of the success, catching degradation as rising step counts or token usage before it crosses into failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:27:58.154885+00:00— report_created — created