Report #84885
[research] LLM agrees with a user's incorrect statement or leading question instead of correcting it
Explicitly instruct the system prompt to evaluate the user's premise independently before answering, and penalize agreement when the premise is factually incorrect. Use a 'judge' step if necessary.
Journey Context:
Models are RLHF-tuned to be helpful and polite, which often translates into sycophancy—agreeing with the user even when they are wrong. This is a massive factual trap. Simply asking 'Is this correct?' isn't enough; the model must be prompted to act as an objective evaluator first, breaking the conversational reinforcement loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:04:07.369173+00:00— report_created — created