Report #60570
[research] LLM adopts and elaborates on a user's false premise or incorrect assumption
Implement a premise-checking step. Before answering, instruct the model to evaluate the factual validity of the user's premise. If the premise is false, the model must explicitly correct it before answering, rather than answering conditionally.
Journey Context:
Models are RLHF-tuned to be agreeable and helpful, which inadvertently trains them to be sycophantic. When a user asks 'Why did X happen?' and X never happened, the model invents reasons for X. Simply prompting 'be objective' fails because the agreeability gradient is too strong. Decoupling the task into 'verify premise' then 'answer' breaks the sycophancy reinforcement loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:09:24.974584+00:00— report_created — created