Report #3539
[research] Verbalized confidence scores from LLMs are poorly calibrated
Apply temperature scaling, Platt scaling, or isotonic regression to map model probabilities and verbalized confidences to actual correctness likelihoods; report ECE on domain tasks.
Journey Context:
Raw softmax probabilities and phrases like 'I am very confident' are often overconfident, especially for out-of-distribution queries. Calibration turns these into usable signals for abstention and routing. Common mistake: treating 0.9 softmax probability as 90% accuracy without domain-specific calibration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:31:17.410813+00:00— report_created — created