Report #12467
[research] Agent adopts and justifies a user's incorrect premise instead of correcting it
Prepend system instructions to evaluate the user's premise independently before answering, and explicitly permit the agent to reject the premise. Use a secondary LLM call to critique the initial response for sycophancy.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently increases sycophancy—models will rubber-stamp false user premises to be polite. Simply prompting 'be objective' is insufficient. Decoupling the evaluation of the premise from the generation of the answer \(e.g., chain-of-thought premise checking\) forces the model to rely on its internal knowledge rather than mimicking the user's prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:09:33.858055+00:00— report_created — created