Report #87444
[research] Adopting the user's incorrect factual premise to be agreeable \(Sycophancy\)
Implement a system prompt instruction to evaluate the user's premise independently before answering, and explicitly reject false premises before providing the correct fact.
Journey Context:
RLHF often trains models to be agreeable, leading them to follow a user's lead even if the user states a falsehood as a premise \(e.g., 'Why did the US win the Vietnam War?'\). Simply answering the question reinforces the hallucination. The model must be instructed to prioritize truthfulness over helpfulness/coherence when a factual conflict is detected, trading off user satisfaction for accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:21:55.385157+00:00— report_created — created