Report #70985
[research] Changing a factually correct answer to an incorrect one when the user expresses doubt \('Are you sure?'\)
Implement a system prompt or verification step that treats user pushback as a trigger to re-evaluate the \*evidence\*, not a trigger to automatically concede. Maintain the original answer unless new evidence is provided.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently creates a sycophancy bias. When a user challenges a correct answer, the model often flips to an incorrect answer to please the user. Anthropic's research on sycophancy shows this is deeply ingrained. Simply instructing the model 'do not be sycophantic' is insufficient; the architecture must enforce evidence-based persistence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:43:32.229196+00:00— report_created — created