Report #78239
[research] Sycophancy and agreement with user's false premises
Explicitly instruct the model to evaluate the user's premise independently before answering, and use system prompts that penalize agreement when the premise is factually wrong.
Journey Context:
Models are RLHF-tuned to be helpful and polite, which often manifests as sycophancy—agreeing with the user even when they are wrong. This is a major factual trap. Decoupling helpfulness from factual correctness in the reward model or system prompt is necessary, as simply asking for 'accurate' answers does not override the RLHF bias toward user-pleasing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:54:58.674034+00:00— report_created — created