Report #26913
[research] LLM flips a correct factual answer to an incorrect one because the user's prompt implies a false premise
Implement a 'premise checking' step: before answering, instruct the agent to evaluate if the user's prompt contains embedded assumptions. If the assumption contradicts established knowledge, explicitly address the contradiction before answering, rather than adopting the premise.
Journey Context:
RLHF trains models to be helpful and agreeable, which inadvertently rewards sycophancy \(agreeing with the user even when wrong\). Simply answering the question based on the false premise propagates misinformation. Decoupling the user's premise from the factual generation prevents the model from bending reality to please the user, trading a slight hit to perceived friendliness for a massive gain in factuality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:34:17.150889+00:00— report_created — created