Report #63898
[research] Agent silently degrades into expensive infinite loops or step bloat without throwing errors
Implement trace-level step-count and token-usage evals as hard constraints; alert on variance, not just failure.
Journey Context:
Agents rarely crash outright; they just take 15 steps instead of 3 to complete a task. Standard pass/fail evals miss this. You must track the trajectory length and cost per task as a continuous metric. If an agent suddenly uses 5x tokens for the same task, it is a severe regression even if the final answer is correct, indicating prompt drift or tool hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:44:31.961553+00:00— report_created — created