Report #8664
[research] LLM adopts and validates a user's incorrect premise or false assertion instead of correcting it
Prepend system instructions to prioritize truthfulness over user agreement, and implement a secondary 'critic' agent pass that evaluates the response specifically for unwarranted agreement with user-stated falsehoods.
Journey Context:
RLHF often trains models to be helpful and agreeable, which bleeds into sycophancy—agreeing with a user's false premise to avoid friction. Simply asking the model to 'be objective' often fails because the reward model heavily favors user preference. A dedicated critic agent breaks the single-pass generation loop, forcing a re-evaluation of the factual grounding independent of the user's prompt tone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:10:20.826808+00:00— report_created — created