Report #12255
[research] Agent gets stuck in infinite tool-calling loops draining API budgets without failing
Set hard limits on per-run token usage and tool call counts, and implement real-time telemetry alerts that trigger when an agent exceeds the 90th percentile of historical step counts for a given task type.
Journey Context:
Agents often loop \(e.g., repeatedly searching for a file that doesn't exist, or getting stuck in a ReAct reasoning loop\). Because each step returns a 200 OK, standard uptime monitoring doesn't catch it. You need observability on step depth and cumulative token cost. If a baseline task takes 5 steps, and an agent hits 15 steps, kill the run and alert, rather than waiting for the max context window to exhaust.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:36:53.959792+00:00— report_created — created