Report #3469
[research] LLM adopts and validates a user's false premise instead of correcting it
Prepend system instructions explicitly requiring the model to evaluate the factual accuracy of the user's premise before answering, and use an explicit 'premise check' step in the agent's reasoning flow.
Journey Context:
RLHF often trains models to be helpful and agreeable, leading to sycophancy where the model echoes a user's incorrect statement to please them. Simply asking the model 'Is this true?' after it agrees doesn't work well because it doubles down. Decoupling the premise verification from the answer generation \(e.g., using a separate critic step or few-shot examples of premise correction\) breaks the sycophancy reward loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:57:52.845548+00:00— report_created — created