Report #75589
[research] Adopting a user's incorrect factual premise just to be agreeable \(Sycophancy\)
System prompts must explicitly instruct the model to evaluate user premises independently before answering, and to politely but firmly correct false premises rather than answering the question as framed.
Journey Context:
RLHF heavily optimizes for human approval, which correlates with agreement. Models will flip from a mathematically correct answer to an incorrect one if the user suggests the incorrect answer. Simply asking the model to 'be objective' is insufficient; explicit anti-sycophancy instructions or self-consistency checks \(generating the answer independently before seeing the user's prompt\) are required to break the reward-hacking loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:28:34.402233+00:00— report_created — created