Report #92767
[architecture] Cascading errors when low-confidence agent outputs propagate through multi-agent workflows without verification
Implement calibrated confidence scores \(0.0-1.0\) with dynamic thresholds per task criticality; route sub-threshold outputs to human-in-the-loop \(HITL\) review before downstream handoff
Journey Context:
Agents often emit binary 'success/fail' or uncalibrated 'confidence: 0.9' scores that don't correlate with actual accuracy \(overconfidence\). In multi-agent chains, one agent's 60% confidence output becomes the next agent's 'ground truth,' amplifying error rates exponentially. The fix requires proper calibration \(using Platt scaling or isotonic regression on validation sets\) so that 0.9 means '90% of outputs at this score are correct.' Then set thresholds based on downstream cost: for a code generation agent, 0.95 for production commits vs. 0.8 for draft suggestions. Below threshold, pause the workflow and surface to a human reviewer via an async queue \(e.g., using Temporal.io or AWS Step Functions with wait states\). Tradeoffs: latency \(human review might take hours\), cost of reviewer time, and the cold start problem \(need human labels to calibrate initially\). Alternative of automated ensemble voting \(3 agents agree\) works for factual queries but not creative tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:17:54.057885+00:00— report_created — created