Report #5745
[research] LLM flips a correct factual answer to match a user's incorrect premise or hint
Implement a debiasing system prompt explicitly instructing the model to evaluate the user's premise independently, or use a two-pass generation where the first pass generates the answer without the user's hint, and the second pass addresses the user's context.
Journey Context:
Models are RLHF-tuned to be helpful and agreeable, leading them to adopt the user's viewpoint even when factually wrong \(sycophancy\). Simply asking 'Are you sure?' often makes the model double down on the sycophantic answer. The tradeoff is between being conversational and being factual. Decoupling the factual generation from the user-pleasing generation is necessary to maintain factuality without losing conversational ability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:07:54.380418+00:00— report_created — created