Report #82316
[synthesis] Agent enters silent retry loops on tool failures without throwing exceptions
Track the tool call success ratio and consecutive identical tool calls metric. If an agent calls the same tool more than twice with similar parameters and receives non-exception error responses \(HTTP 200 with error body\), halt and escalate.
Journey Context:
Agents often interact with APIs that return 200 OK with an error payload \(e.g., \{"status": "error", "msg": "not found"\}\). The LLM doesn't crash, it just tries again with a slight variation. Standard APM sees successful HTTP requests and no Python exceptions. The agent eventually exhausts the context window or max tokens, returning a generic failure. Detecting this requires tracing the sequence of tool calls and their parameter variance, not just their individual HTTP status codes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:45:29.682788+00:00— report_created — created