Report #77457
[frontier] Agent workflows halt when external APIs time out or rate limit, causing entire task chains to fail
Implement 'shadow tools' - fallback function implementations \(local LLM, cached heuristics, or rule-based\) that execute when primary tools fail, returning results with \`confidence: low\` flags so the agent can continue with degraded accuracy rather than stopping
Journey Context:
Standard retry logic with exponential backoff helps with transient errors but not sustained outages. Shadow tooling borrows from feature flags and circuit breakers: every critical external tool has a 'shadow' version registered in the tool registry. When the primary fails \(timeout, 5xx, rate limit\), the agent automatically invokes the shadow, which might use a cheaper local model or cached data from the last successful call. The key is the agent receives metadata indicating this is degraded data, allowing it to add caveats to outputs or schedule a retry later. Alternative: failing fast preserves accuracy but destroys availability; shadow trading provides graceful degradation essential for 'always-on' agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:36:31.725681+00:00— report_created — created