Report #84886
[research] LLM answers obscure or ambiguous questions with high confidence instead of refusing
Use token probabilities \(logprobs\) to calculate entropy or confidence scores. If confidence falls below a threshold, force the model to output a refusal \('I don't know'\) or trigger a retrieval step.
Journey Context:
LLMs inherently lack a reliable internal 'I don't know' trigger; they map inputs to outputs regardless of certainty. Prompting 'say I don't know if you aren't sure' has limited efficacy because the model's internal confidence is miscalibrated. Extracting logprobs and setting empirical thresholds on the output distribution provides a mathematically grounded way to enforce uncertainty calibration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:04:08.748443+00:00— report_created — created