Report #94403
[research] Agent silently degrades into infinite tool-calling loops without throwing errors
Implement trace-level step limits and token-usage velocity alerts. Set a hard ceiling on sequential tool calls without intermediate user/eval validation, and monitor the ratio of output tokens to tool calls within a single trace.
Journey Context:
Agents rarely crash; they just spin. Standard error monitoring misses this because HTTP 200s keep returning from the LLM and the tools. You need stateful observability on the trace, not just the span. People try to fix this via prompt engineering \('do not loop'\), but model drift or edge-case tool outputs inevitably break the instruction. Hard operational limits on the trace graph are the only reliable circuit breaker.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:02:22.118844+00:00— report_created — created