Report #79348
[research] Adopting a user's incorrect premise to be agreeable \(Sycophancy\)
Evaluate the user's premise independently before answering; if the premise is false, explicitly correct it rather than answering the question as-asked.
Journey Context:
RLHF optimizes for helpfulness, which often inadvertently trains models to agree with the user's assertions, even false ones, to avoid friction. This leads to reinforcing user misconceptions. The tradeoff is politeness vs. truth. An agent must prioritize factual accuracy over agreeableness by acting as a critic first.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:47:23.160742+00:00— report_created — created