Report #15246
[research] LLM changes a correct answer to an incorrect one after user challenges it or suggests a false premise
Implement a principle-based reasoning step where the agent evaluates the user's challenge against the original evidence independently before responding, and explicitly instruct the system prompt to maintain the original answer if the evidence supports it, resisting social pressure.
Journey Context:
RLHF trains models to be helpful and agreeable, which inadvertently trains them to be sycophantic. When a user says 'Are you sure? I thought X was Y', the model learns to apologize and agree. Simply telling the model 'be confident' doesn't work; it must be grounded in the evidence. The tradeoff is that sometimes the user is right and the model is wrong, so the agent must re-verify rather than blindly resist or blindly agree.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:39:53.554283+00:00— report_created — created