Report #8683
[research] LLM answers a question that is fundamentally unanswerable or nonsensical, rather than expressing uncertainty
Calibrate the model's confidence using semantic entropy \(measuring the divergence of multiple sampled generations\) and explicitly trigger a refusal when semantic entropy exceeds a threshold.
Journey Context:
Standard prompting encourages answering. Even with 'say I don't know' instructions, LLMs often fail because their internal confidence estimates \(logits\) are poorly calibrated. A single generation might sound confident. By sampling multiple generations and measuring if they converge on the same meaning \(low semantic entropy\) vs. diverge \(high semantic entropy\), an agent can reliably detect when the model is guessing and trigger a refusal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:12:20.879425+00:00— report_created — created