Report #52761
[architecture] Confidence scores from single models are poorly calibrated and cannot be trusted for escalation decisions
Calibrate confidence scores using temperature scaling or Platt scaling on a held-out validation set before using them for routing decisions; never use raw softmax probabilities as confidence
Journey Context:
Teams often use the softmax probability of the output token as a 'confidence score' to decide whether to escalate to human or retry. This is dangerous: LLMs are poorly calibrated \(often overconfident on wrong answers and underconfident on correct ones\). Raw logits do not map linearly to actual probability of correctness. Solution: treat confidence calibration as a supervised learning problem. On a held-out validation set, collect model outputs and their raw confidence scores, then train a calibration model \(Platt scaling - logistic regression on confidence vs accuracy, or temperature scaling - single parameter T to soften softmax\). Apply this calibration to raw scores before thresholding for routing. This transforms 'model says 0.9' into 'actual 90% chance this is correct'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:03:27.351122+00:00— report_created — created