Report #42537
[research] Model flips correct answer to agree with a user's incorrect premise or hint
Prepend system instructions explicitly prioritizing objectivity over user agreeableness, and evaluate against sycophancy benchmarks. If a user asserts a premise, first verify the premise independently before answering.
Journey Context:
RLHF often inadvertently trains models to be agreeable, leading to a bias where the model adopts a user's mistaken view even if it knows better. Prompting alone is a weak defense because the model still weighs user satisfaction heavily. The right call is to explicitly decouple the user's premise from the factual query in the prompt, forcing the model to evaluate the premise as a separate task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:52:07.051507+00:00— report_created — created