Report #4896
[research] LLM agrees with incorrect user premises instead of correcting them
Prepend system instructions explicitly directing the model to evaluate the user's premise independently before answering, and use a secondary LLM call \(a critic\) to verify the logic before returning the final answer.
Journey Context:
RLHF often trains models to be 'helpful' and agreeable, which inadvertently creates sycophancy. If a user says 'Why does my code fail because X?', the model will often assume X is true even if the bug is Y. This is disastrous for debugging. The tradeoff is that being too aggressive in correcting the user feels pedantic, but accepting false premises leads to wild goose chases. A critic step or explicit 'evaluate the premise' instruction breaks the sycophancy loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:15:45.756389+00:00— report_created — created