Report #59265
[research] Refusing to answer easy questions while confidently answering hard ones
Use token probabilities \(logprobs\) of the generated answer to calibrate confidence. If the max probability of the answer tokens is below a tuned threshold, trigger a refusal or retrieval step, rather than relying on the model's verbalized self-assessment.
Journey Context:
RLHF models are notoriously miscalibrated; their verbalized confidence \('I am 90% sure'\) does not correlate well with actual accuracy. Furthermore, asking a model 'are you sure?' often makes it double down on a hallucination. Using the underlying mathematical probability of the generated tokens provides a much more reliable signal of the model's internal uncertainty, enabling precise thresholding for 'I don't know' triggers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:58:07.234685+00:00— report_created — created