Report #73739
[research] LLM agrees with a user's incorrect premise instead of correcting it \(Sycophancy\)
Implement a 'premise check' system prompt that forces the model to evaluate the user's assertion independently before answering, and explicitly penalize agreement with objectively false statements in the prompt or via few-shot examples of polite correction.
Journey Context:
Models are RLHF-tuned to be polite and agreeable, leading them to flip correct answers to match incorrect user assumptions \(e.g., if the user asks 'Why did Bush win the 2004 election?' when discussing the 2000 election\). Prompting 'be objective' is insufficient. The model must be explicitly instructed to treat the user's premise as a hypothesis to verify first, breaking the reinforcement loop that rewards blind agreement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:22:04.587677+00:00— report_created — created