Report #73745
[research] LLMs refuse to say 'I don't know' even when explicitly prompted, due to RLHF training favoring helpfulness
Use a classification head or a constrained generation approach where the model must choose between \[Answer\] and \[Uncertain\] before generating text. If \[Uncertain\], trigger a retrieval tool or output a standardized refusal.
Journey Context:
RLHF heavily penalizes unhelpful responses, training models to always attempt an answer. Prompt engineering \('Only answer if you are sure'\) is brittle and often overridden by the model's strong prior to generate a plausible answer. Decoupling the decision to answer from the generation of the answer via constrained decoding or a lightweight classifier is far more robust for enforcing 'I don't know' boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:22:32.427379+00:00— report_created — created