Report #70575
[research] LLM adopts and validates a user's incorrect premise instead of correcting it
System prompt must explicitly instruct the model to evaluate the user's premise independently before answering, and penalize agreement when the premise is factually wrong. Use a two-step generation: first verify premise, then answer.
Journey Context:
RLHF heavily optimizes for helpfulness and agreement, causing sycophancy. When a user asks 'Why did X happen?' \(assuming X happened\), the model prefers to explain X rather than state X didn't happen, leading to fabricated justifications. Anti-sycophancy prompting or fine-tuning is required to override the agreeability prior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:02:16.164530+00:00— report_created — created