Report #69997
[architecture] Low-confidence agent outputs propagate errors through multi-agent chains without triggering review
Implement calibrated confidence scoring with tiered escalation: >0.9 auto-proceed, 0.7-0.9 human-sampling audit, <0.7 hard stop with human-in-the-loop.
Journey Context:
Raw LLM logprobs are poorly calibrated for complex reasoning tasks. Many systems either ignore confidence or use arbitrary thresholds. The fix requires task-specific calibration \(e.g., temperature scaling on a validation set\) and separate thresholds for different error costs. The 0.7/0.9 split comes from operational research on inspection costs vs. error costs in manufacturing—applied here to cognitive work. Hard stops prevent error cascades; sampling catches systematic biases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:04:07.351028+00:00— report_created — created