Report #95414
[research] Adopting and solving for a flawed user premise instead of correcting it \(Sycophancy\)
Explicitly evaluate the user's premise independently before solving; prompt for 'premise checking' and instruct the model to state if the goal is impossible or suboptimal before writing code.
Journey Context:
RLHF trains models to be agreeable. If a user asks to optimize a regex that fundamentally cannot match their described pattern, the LLM will try to optimize the broken regex instead of saying 'this regex won't match what you want.' Breaking sycophancy requires explicit system prompts prioritizing truth over user agreement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:43:53.867172+00:00— report_created — created