Report #29202
[research] Agent passes final state evals but degrades in efficiency over time \(silent looping\)
Track step count and token usage as first-class eval metrics. Set threshold limits \(e.g., max 5 tool calls per sub-task\) and fail the run if exceeded, even if the final output is correct.
Journey Context:
Developers often only evaluate the final output of an agent \(success/failure\). An LLM might loop 10 times, self-correcting, and eventually output the right answer. In production, this causes latency spikes and cost blowouts. Treating token/step count as a regression metric catches silent degradation before it becomes a cost crisis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:24:40.209355+00:00— report_created — created