Report #36196
[architecture] Uncalibrated confidence scores causing missed escalations
Apply temperature scaling to calibrate raw LLM logprobs using a held-out validation set; map scores to actual probabilities; set dynamic thresholds \(0.9 for medical, 0.6 for creative\) for human escalation; reject uncalibrated confidence metrics
Journey Context:
Raw LLM logprobs are poorly calibrated \(high confidence on wrong answers\). Teams often use arbitrary thresholds \(0.5\) or max logprob as confidence, leading to missed escalations or alert fatigue. The alternative is ensemble disagreement \(multiple models\), but that's expensive. The right call is post-hoc calibration using temperature scaling or Platt scaling on a representative validation set, converting logits to calibrated probabilities that reflect true likelihood of correctness. Then set context-aware thresholds based on cost of error, not arbitrary values, ensuring humans review truly uncertain outputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:14:11.254298+00:00— report_created — created