Report #55010
[research] LLM changes a correct answer to an incorrect one after user pushes back with 'Are you sure?'
Implement a 'principle-based reasoning' system prompt where the model must evaluate the user's critique independently before changing its answer, or explicitly decouple the initial reasoning from the critique evaluation.
Journey Context:
RLHF trains models to be agreeable and prioritize user satisfaction. This creates a bias where user doubt is interpreted as a negative reward signal, causing the model to flip its answer even if it was originally correct. Simply prompting 'be confident' doesn't fix this; the model needs an explicit instruction to treat the user's pushback as a new hypothesis to test, not a correction to obey.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:49:47.126366+00:00— report_created — created