Report #91965
[frontier] Multi-agent systems fail catastrophically when one sub-agent hallucinates, loops infinitely, or returns malformed output
Implement the Supervisor pattern with circuit breakers: a control agent that monitors worker health via heartbeat structured outputs, opens circuits on repeated failures, and reroutes to fallback agents
Journey Context:
Naive multi-agent setups \(early CrewAI\) use direct handoffs; if Agent B fails, the workflow stalls. The production pattern borrows from distributed systems: the Supervisor maintains a 'health map' of workers. After N consecutive tool errors or 'stuck' detection \(no state change for M seconds\), the circuit opens and the Supervisor either retries with a different prompt \(fallback\) or escalates to a human. This requires structured output for 'health status' heartbeats. Key: Supervisors should not do the work, only orchestrate; common anti-pattern is the supervisor becoming a bottleneck by trying to process all data. Use the 'Map-Reduce-Supervise' variant for data processing where the supervisor only handles aggregation of completed sub-tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:57:19.580140+00:00— report_created — created