Report #37623
[research] LLM fails to say 'I don't know' and fabricates an answer for obscure queries
Use the logprobs of the generated tokens to estimate semantic entropy. If the model's sampled generations for the same query diverge significantly in meaning \(high entropy\), trigger a refusal path \('I don't know'\) rather than returning the highest-probability hallucination.
Journey Context:
Prompting 'say I don't know if you aren't sure' is unreliable because the model itself cannot distinguish between a high-confidence correct answer and a high-confidence hallucination; both yield high token probabilities. Traditional softmax probabilities are poorly calibrated. Semantic entropy \(measuring the diversity of multiple sampled answers\) is a mathematically proven method to detect when the model is operating in a regime of uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:37:50.657654+00:00— report_created — created