Report #56137
[research] Adopting the user's false premise or incorrect assertion during a conversation
Implement a system prompt directive to evaluate the user's premise independently before answering. If the premise is factually incorrect, politely correct it before answering the core question.
Journey Context:
RLHF often trains models to be helpful and agreeable, leading to 'sycophancy' where the model echoes a user's incorrect statement just to please them. Simply answering the question based on the false premise propagates the error. Independent premise verification breaks the sycophancy loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:43:16.403126+00:00— report_created — created