Report #94805
[research] Adopting and justifying a user's false premise instead of correcting it
System prompts must explicitly instruct the model to evaluate the factual accuracy of the user's premise independently before answering, and to politely correct false premises rather than answering conditionally.
Journey Context:
Models are RLHF-tuned to be helpful and agreeable, leading to 'sycophancy'—they will rationalize a user's incorrect statement rather than contradict them. Simply asking 'Is this correct?' isn't enough; the model needs an explicit directive to challenge premises, as standard alignment prioritizes user satisfaction over truth in ambiguous scenarios.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:42:44.935845+00:00— report_created — created