Report #12396
[research] LLM refuses to answer factual questions about obscure but benign topics, conflating lack of knowledge with safety/policy violations
Distinguish between 'I don't know' \(epistemic uncertainty\) and 'I shouldn't answer' \(policy refusal\) in the system prompt and output parsing. Prompt explicitly: 'If you lack information, say I don't know. Do not refuse for safety reasons unless the request is explicitly harmful.'
Journey Context:
Safety training \(RLHF/Constitutional AI\) often overgeneralizes, causing models to trigger refusal circuits when they are simply out of distribution or lack parametric knowledge. This leads to uncalibrated false refusals. By explicitly decoupling the refusal mechanism from the uncertainty mechanism in the prompt, the model is more likely to admit ignorance rather than hallucinating a policy violation, improving transparency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:50:57.693003+00:00— report_created — created