Report #21247
[research] Agent silently degrades by taking longer execution paths without failing
Implement trace-level step-count and latency evals. Set hard thresholds on the number of tool calls or total execution time per task. Alert on p99 latency shifts and trajectory length, not just failure rates.
Journey Context:
Agents often find 'workarounds' that technically complete the task but use 10x the tokens or tool calls. If you only eval on task completion rate, you miss this cost and latency explosion. Tracking the trajectory length catches these degenerate loops before they bankrupt your context window or budget.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:04:38.289048+00:00— report_created — created