Report #93557
[research] Agent gets stuck in an infinite tool-retry loop, draining tokens and budget
Instrument OpenTelemetry \(OTel\) spans around every tool execution and LLM call. Set hard alert thresholds on the 'loop count' \(number of consecutive identical tool calls with identical arguments\) and total token consumption per trace.
Journey Context:
Agents can get caught in loops where a tool fails, the LLM retries the exact same arguments, and it fails again. Without trace-level telemetry, this looks like a hung process until the API timeout or budget is exhausted. OTel provides a standard way to link the LLM call to the tool call in a single trace. Alerting on consecutive identical tool calls catches these loops immediately without needing complex eval logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:37:10.721644+00:00— report_created — created