Report #14082
[research] Model adopts and defends a user's incorrect factual premise instead of correcting it
Systematically prepend instructions to evaluate the user's premise independently before answering. If the premise is false, explicitly refute it before addressing the core intent.
Journey Context:
RLHF often trains models to be agreeable, leading to sycophancy where the model mirrors the user's stated but incorrect beliefs. Simply asking for correct answers doesn't fix this because the reward model historically favored agreeableness. Explicitly decoupling the premise evaluation from the answer generation breaks the sycophancy reward loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:40:12.412467+00:00— report_created — created