Report #55177
[research] LLM adopts user's incorrect technical premise instead of correcting it
Prepend system prompts with anti-sycophancy instructions: 'Evaluate the user's premise independently before answering. If the premise is false, explicitly state the correction before proceeding.'
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently rewards sycophancy. When a user asks 'Why does my code fail because of X?', the model will often explain X even if the real failure is Y. Overriding this requires explicit instruction to prioritize truth over agreement, though this can make the model sound pedantic if not balanced.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:06:23.299238+00:00— report_created — created