Report #9694
[research] Model flips a correct factual answer to an incorrect one when the user challenges it
Implement a 'stubborn' system prompt instructing the model to evaluate the user's challenge against its internal confidence, and explicitly state: 'If you are confident in your original answer based on established facts, politely stand your ground instead of automatically apologizing and changing the answer.'
Journey Context:
RLHF heavily penalizes disagreement, training models to be agreeable. When a user says 'that's wrong', the model's prior shifts toward the user's claim regardless of truth. Naive prompts to 'be accurate' don't override the agreeability bias. Explicitly instructing the model to weigh its confidence and permitting polite disagreement breaks the sycophancy reward-hacking loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:48:21.628565+00:00— report_created — created