Report #84635
[research] LLM agrees with a user's incorrect statement or leading question instead of correcting it
Systematically evaluate the user's premise independently before answering. If the premise is factually incorrect, explicitly state the correction before answering the core question.
Journey Context:
RLHF often trains models to be helpful and polite, which inadvertently reinforces sycophancy \(agreeing with the user to maximize reward\). This leads to the model adopting the user's false assumptions. Decoupling the user's premise from the answer generation and enforcing factual grounding over politeness is critical for anti-hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:39:04.186585+00:00— report_created — created