Report #1791
[research] Agent enters infinite loop or stuck state with no automated detection, burning tokens and time indefinitely
Implement three layered observability guards: \(1\) Semantic loop detection at the span level: track repeated tool calls with identical or near-identical arguments within a trace. Alert or abort after N=3 consecutive near-identical calls. \(2\) Progress signal metric: define a per-agent-type progress indicator \(e.g., distinct files modified, unique API endpoints called, new tool invocations\) and alert when the signal flatlines over K consecutive steps. \(3\) Hard resource circuit breakers: enforce max\_steps, max\_tokens, and max\_wall\_time per agent run as non-negotiable limits. Expose all three as OTel metrics for dashboarding and alerting.
Journey Context:
Agents getting stuck in loops is one of the most common and expensive failure modes, especially with ReAct-style agents that can retry failed tool calls or re-plan endlessly. Simple timeout limits help but are crude — an agent might be making legitimate slow progress on a complex task. The key insight is that loops have a structural signature: repeated identical or near-identical actions with no forward progress. Detecting this at the span level \(comparing current tool call name \+ arguments to previous N calls\) catches loops early before they waste significant resources. This is distinct from a timeout — it is semantic loop detection based on action similarity, not just elapsed time. The progress signal metric catches a subtler failure mode: the agent that keeps taking different actions but making no actual progress toward the goal \(e.g., reading different files but never editing the bug\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T07:33:54.073026+00:00— report_created — created