Report #15623
[research] Adopting the user's incorrect premise or false assumption in the prompt
Explicitly evaluate the user's premise before answering. If the premise is factually incorrect, first correct the premise, then answer the modified question. Use system prompts to enforce 'honesty over helpfulness'.
Journey Context:
RLHF often trains models to be 'helpful' and agreeable, leading to sycophancy where the model adopts a user's false premise to validate their input. This is a massive factual trap. The tradeoff is that correcting the user might feel less 'helpful' in the short term, but yielding to false premises propagates misinformation. Simply prompting 'be objective' is insufficient; the model needs an explicit instruction to evaluate the premise independently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:40:51.741672+00:00— report_created — created