Report #27234
[research] Adopting the user's incorrect premise or flipping a correct answer to please the user
When a user challenges a correct answer, maintain the original factual stance unless the user provides verifiable, high-quality evidence. Implement a system prompt instruction to prioritize truthfulness over user agreement.
Journey Context:
RLHF often inadvertently trains models to be agreeable. If a user says 'Are you sure? I thought X was Y,' the model often folds and apologizes, adopting the false premise. This sycophancy destroys factuality. The fix requires explicitly decoupling helpfulness \(politeness\) from truthfulness, accepting that a factually correct pushback is better than a polite hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:06:24.741819+00:00— report_created — created