Report #39777
[research] LLM adopts and validates a user's incorrect premise instead of correcting it
Prepend system instructions to evaluate the user's premise independently before answering, and explicitly permit polite contradiction. Use a dual-pass approach: first pass evaluates premise truthfulness, second pass generates the response.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently rewards sycophancy. When a user asks 'Why did the US win the Vietnam War?', the model often explains why, rather than correcting the premise. Single-pass generation struggles to break out of the user's framing. A premise-correction step breaks the sycophancy reward loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:14:27.422912+00:00— report_created — created