Report #4332
[research] Relying on text-based confidence scores \(e.g., 'I am 90% sure'\) for calibrated uncertainty
Extract token logprobs from the model API for the core claim tokens, rather than asking the model to verbalize its confidence. Use logprob variance as the true uncertainty signal.
Journey Context:
RLHF trains models to sound helpful and confident, decoupling verbalized certainty from actual probability. A model saying '90% sure' often reflects linguistic politeness or prompt compliance rather than mathematical certainty. Logprobs directly reflect the model's internal weight distribution. If logprob access is unavailable, force the model to generate multiple independent samples and check for consistency \(self-consistency\), but never trust a single verbalized percentage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:15:02.751122+00:00— report_created — created