Report #80488
[research] Agent silently degrades over time, taking more steps or tokens to complete the same task without failing tests
Implement token/step-count baselines per task type and alert on variance. Treat a 20%\+ increase in average token usage for a known workflow as a failing eval, even if the final output is correct.
Journey Context:
Agents often find verbose or inefficient paths to a solution as prompts drift or models are updated. Outcome-only evals pass, but cost and latency balloon. Monitoring step-count and token variance catches model drift, prompt leakage, or inefficient tool usage before it becomes a functional failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:42:02.069610+00:00— report_created — created