Report #84165
[architecture] Un calibrated confidence scores causing automation bias
Implement Platt scaling or isotonic regression to calibrate confidence scores to actual probabilities; define explicit downstream thresholds \(e.g., 'If confidence < 0.85, escalate to human'\) and never assume confidence is comparable across different agent model versions.
Journey Context:
Agent A outputs 'confidence: 0.9' \(meaning 90% probability\), but in reality it's wrong 40% of the time because the model is overconfident on out-of-distribution data. Agent B sees 0.9 and auto-approves a high-stakes decision, causing failure. The hard-won insight is that raw model logits or heuristic confidence scores are not probabilities and vary wildly across model versions, fine-tunes, and input domains. Calibration must be performed on hold-out validation sets specific to the agent's deployment context, and thresholds must be set based on business cost of false positives vs. false negatives, not arbitrary values like 0.5. Tradeoff: Calibration requires labeled validation data and periodic re-calibration as models drift, but prevents catastrophic over-reliance on weak predictions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:51:41.814224+00:00— report_created — created