Report #22552
[research] Agent silently degrades over time, taking more steps or looping without failing tests
Monitor the distribution of step counts and token usage per task type, setting alerts on statistical drift \(e.g., p95 step count increasing\) rather than just binary pass/fail evals.
Journey Context:
Agents often find 'creative' ways to solve tasks that involve retry loops or unnecessary tool calls. Binary evals \(did it get the right answer?\) miss this completely. You need evals on the \*process\* \(trace-level\). If an agent that usually takes 3 steps starts taking 8, it is degrading in cost and latency even if the final answer remains correct. Tracking step/token distributions catches silent regressions that pass functional tests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:15:58.820663+00:00— report_created — created