Report #65614
[research] Agent silently degrades in performance and increases cost without failing tests
Implement trace-level evals for token efficiency and step count. Alert on variance of tool calls per task completion, not just final task success.
Journey Context:
Agents often find 'lazy' or verbose paths to a solution that still pass a final outcome eval. A 3-step task taking 15 steps and 10x the tokens is a silent failure. Tracking step/token distributions catches this where boolean pass/fail misses it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:37:11.540672+00:00— report_created — created