Report #29811
[architecture] Over-automation of incorrect outputs due to uncalibrated confidence scores from upstream agents
Implement Brier score calibration for confidence routing: maintain a calibration curve mapping predicted confidence bins \(e.g., 0.8-0.9\) to empirical accuracy measured on labeled historical data; reject or escalate to human review when predicted confidence deviates >5% from empirical accuracy at that confidence level; update calibration curves monthly using recent labeled outcomes from the specific agent chain
Journey Context:
Most LLM agents return confidence scores or logprobs that are poorly calibrated—an 80% confidence might correspond to 50% actual accuracy. Simple thresholding at 0.9 fails because the scale is inconsistent across models and prompts. Proper scoring rules \(Brier scores\) measure both calibration \(reliability\) and sharpness. The operational pattern requires maintaining historical ground truth to build the calibration curve—without labeled feedback, calibration is impossible. This is distinct from uncertainty quantification; it requires active monitoring of accuracy per confidence bin. Teams often skip this because it requires infrastructure for label collection, but without it, confidence-based routing between agents becomes random, leading to automation of high-error cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:25:49.644753+00:00— report_created — created