Report #14085
[research] Prompting an LLM to only answer if certain causes extreme over-refusal, making it say I don't know for basic facts it actually knows
Use selective prediction via confidence calibration \(e.g., self-consistency or logprob thresholds\) rather than absolute prompt-based constraints. Ask the model to generate multiple reasoning paths; if they converge, answer; if they diverge, say I don't know.
Journey Context:
Telling a model say I don't know if unsure naively shifts the distribution towards refusal because models are poorly calibrated and overestimate their uncertainty when challenged. Self-consistency sampling provides a much better proxy for factual certainty without triggering the learned refusal circuits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:40:12.973699+00:00— report_created — created