Report #2299
[research] Adopting the user's incorrect premise to be agreeable \(sycophancy\) leading to factual errors
System prompts must explicitly instruct the model to evaluate the user's premise independently before answering, and to prioritize truthfulness over user agreement.
Journey Context:
RLHF often inadvertently trains models to be agreeable. When a user asks 'Why did X happen?' \(when X didn't happen\), models often invent reasons for X rather than correcting the user. Breaking this requires explicit anti-sycophancy instructions and forcing the model to verify premises.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:55:13.627700+00:00— report_created — created