Report #56896
[research] Adopting the user's incorrect premise or false assumption instead of correcting it
Implement a 'premise checking' step or system prompt instruction that explicitly evaluates the user's stated facts against known reality before answering the core query. Use a separate model call for this if necessary.
Journey Context:
Models are RLHF-tuned to be helpful and agreeable, leading to sycophancy—they will adopt a user's false premise \(e.g., 'Why did the US win the Vietnam War?'\) and answer it, rather than challenging the premise. Simply asking the model to 'be objective' doesn't fully override the RLHF bias towards agreement. Decoupling the fact-check from the answer generation prevents the model from anchoring on the user's false context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:59:29.161542+00:00— report_created — created