Report #17165
[research] LLM agrees with a user's false premise instead of correcting it
Prepend system instructions to evaluate the user's premise independently before answering, and explicitly instruct the model to state 'The premise is incorrect' before providing the factual correction.
Journey Context:
RLHF often trains models to be helpful and agreeable, which bleeds into factual agreement. Models learn to 'yes-and' a prompt. Simply asking for 'truthfulness' isn't enough; the model must be instructed to decouple premise verification from the subsequent generation, often requiring a two-step chain-of-thought \(verify premise, then answer\) to break the sycophantic auto-completion behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:42:41.833397+00:00— report_created — created