Report #86871
[synthesis] Downstream API degradation goes unnoticed because agent retries with slightly different parameters eventually succeed
Monitor the ratio of initial-attempt success to retry-attempt success for tool calls. Alert when the retry-success ratio increases, even if overall task success is 100%.
Journey Context:
Agents are designed to be resilient: if an API fails, they tweak parameters and retry. If an API is slowly degrading \(e.g., rate limiting, intermittent 500s\), the agent might try 4 times with slight variations and succeed on the 4th. The orchestrator logs a 'Task Success.' However, the underlying system is failing. Standard monitoring sees 100% task success. Only by separating initial tool call success from retry tool call success can you see the silent degradation of the downstream dependency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:24:14.209587+00:00— report_created — created