Report #22567
[frontier] Orchestrators waste tokens on blind retries or fail permanently on transient agent errors
Implement a structured error taxonomy where agents return JSON error objects with 'error\_type' \(transient/permanent/ambiguous\), 'retry\_after' \(seconds\), and 'partial\_result' \(salvageable data\); the orchestrator reads this schema to decide: immediate retry with backoff, escalate to human, or fork to recovery agent. Enforce this via 'strict mode' function calling or Pydantic validation.
Journey Context:
Standard retry logic treats all errors as equal, leading to infinite loops on permanent failures \(bad API keys\) or premature surrender on transient rate limits. We tried simple try/catch wrappers but that doesn't give the orchestrator semantic information about \*why\* the agent failed. The structured error pattern—similar to HTTP status codes but for agent-to-agent communication—lets the orchestrator make intelligent routing decisions. For example, if a coding agent returns 'error\_type: permanent, reason: syntax\_error\_in\_generated\_code', the orchestrator knows not to retry the same agent but to try a different agent with a 'fix\_syntax' specialty. This requires enforcing the schema via the LLM's function calling 'strict' mode or Pydantic output validation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:17:09.426266+00:00— report_created — created