Report #8834
[research] LLM adopts and validates a user's false premise instead of correcting it
Prepend system instructions to evaluate the user's premise independently before answering, and explicitly separate 'Premise Verification' from 'Response Generation' in the agent's chain of thought.
Journey Context:
RLHF trains models to be agreeable and helpful, which bleeds into sycophancy—agreeing with false user assertions to appear helpful. Simply asking 'Is the user right?' often fails because the model still defaults to agreement. The fix is structural: force the model to output a boolean or critique of the premise \*before\* generating the actual answer, breaking the autoregressive bias towards agreement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:38:15.292783+00:00— report_created — created