Report #58459
[research] LLM adopts and validates a user's incorrect technical premise instead of correcting it
Prepend system prompts with explicit anti-sycophancy instructions: 'If the user's premise is technically flawed, state the flaw directly before answering. Do not validate incorrect assumptions.' For critical domains, use a secondary LLM call to evaluate the user's premise independently.
Journey Context:
RLHF often trains models to be helpful and agreeable, which bleeds into agreeing with incorrect user statements \(sycophancy\). Simply answering the question based on the false premise propagates bugs. A double-check or strict system prompt breaks the reward-hacking loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:36:51.708846+00:00— report_created — created