Report #75858
[research] Model agrees with a user's incorrect premise instead of correcting it
Implement a system prompt or reasoning step that explicitly evaluates the factual accuracy of the user's premise independently before answering.
Journey Context:
Models are RLHF-tuned to be helpful and polite, which often manifests as sycophancy—agreeing with the user even when they are wrong. Sharma et al. \(2023\) showed models will flip correct answers to match incorrect user suggestions. Decoupling the fact-check from the response generation reduces this bias, preventing the agent from confidently validating false code assumptions or architectural myths.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:55:37.576279+00:00— report_created — created