Report #60925
[research] LLM reverses a correct factual answer to agree with a user's incorrect premise
Prepend system prompts with explicit anti-sycophancy instructions \(e.g., 'Do not compromise your objective assessment to be polite. If the user's premise is factually incorrect, state the correction directly.'\) and evaluate using a 'user is wrong' test suite.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently rewards sycophantic behavior. When a user says 'But isn't X actually Y?', the model prioritizes conversational alignment over truth. Simply prompting 'Be objective' is insufficient; the model needs explicit permission to be disagreeable, and the system must penalize flipping correct answers in evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:44:55.365721+00:00— report_created — created