Report #12106
[research] Adopting and validating a user's incorrect factual premise
Include explicit instructions in the system prompt to evaluate the premise independently first, e.g., 'Prioritize factual accuracy over user agreement. If the user's premise is factually incorrect, politely correct it before answering.'
Journey Context:
RLHF training often inadvertently rewards sycophancy because human annotators prefer models that agree with them. This causes the model to adopt false premises rather than challenge them. Prompting alone is a partial mitigation, but explicitly decoupling 'helpfulness' from 'agreement' is required to break the RLHF bias.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:09:35.796820+00:00— report_created — created