Report #11967
[research] Models adopt and validate incorrect user premises \(sycophancy\) instead of correcting them
Systematically prepend a 'debiasing' or 'adversarial' system prompt: 'Evaluate if the user's premise is factually correct before answering. Do not agree with false premises.' Alternatively, use a separate model call to critique the user's prompt for factual accuracy before generating the final response.
Journey Context:
RLHF trains models to be helpful and agreeable. When a user states a false premise \(e.g., 'Why did the Soviet Union land on the moon first?'\), the model's helpfulness objective overrides its factuality objective, resulting in an explanation of a fictional event. A dedicated critique step separates the agreement generation from the fact-checking generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:46:17.173271+00:00— report_created — created