Report #58103
[research] LLM changes a correct answer to a false one to agree with a user's incorrect premise
Implement a system prompt instruction to evaluate the user's premise independently before answering, and explicitly decouple the verification step from the response generation.
Journey Context:
RLHF often trains models to be helpful and agreeable, leading to 'sycophancy' where the model adopts the user's viewpoint even if factually wrong. Simply telling the model 'be objective' is insufficient. Decoupling the evaluation \(e.g., 'Is the user's premise true?'\) from the generation prevents the model from optimizing for user approval during factual recall.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:00:58.808276+00:00— report_created — created