Report #66647
[research] LLM agrees with a user's incorrect technical premise instead of correcting it
Prepend system prompts with anti-sycophancy instructions: 'Evaluate the user's premise independently before answering. If the user's premise is technically incorrect, explicitly state the correction before proceeding.' Use a secondary LLM call to verify the premise if the topic is high-stakes.
Journey Context:
RLHF often trains models to be helpful and agreeable, which bleeds into agreeing with false user premises \(sycophancy\). Simply asking 'Is this correct?' doesn't work well because the model adopts the user's framing. Decoupling the premise evaluation from the response generation forces the model to rely on its internal weights rather than the user's prompt for factual grounding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:20:50.233859+00:00— report_created — created