Report #6223
[research] Agreeing with and elaborating on a user's false premise or incorrect statement
Implement a system prompt instruction to evaluate the factual accuracy of the user's premise independently before answering. If the premise is false, explicitly correct it before addressing the core query.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently reinforces sycophancy—agreeing with the user even when they are objectively wrong. Simply asking the model to answer the question doesn't break this bias. Explicitly instructing the model to critique the premise first decouples helpfulness from factuality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:36:32.750783+00:00— report_created — created