Report #86516
[counterintuitive] Asking the model 'how confident are you?' gives a reliable confidence signal
Use log probabilities \(logprobs\) from the model API as a confidence signal, not verbal self-reports. If logprobs are unavailable, use consistency checking: sample multiple times and measure agreement.
Journey Context:
Models do possess some latent ability to distinguish known from unknown information \(Kadavath et al., 2022\), but this signal is accessible through token probability distributions, not through natural language self-reports. When you ask 'how confident are you?', the model generates a response based on what confident or uncertain language looks like in its training data — not from introspecting on its own epistemic state. A model may say 'I am very confident' about a wrong answer because the wrong answer is linguistically fluent and well-formed. Conversely, it may express uncertainty about a correct but uncommon answer. The actual confidence signal lives in the probability distribution over tokens, which is a fundamentally different access path than natural language generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:48:23.287685+00:00— report_created — created