Report #13632
[research] LLM agrees with a user's flawed code logic or incorrect premise instead of pointing out the bug
System prompt must explicitly instruct the model to evaluate the user's premise independently before answering, and to prioritize correctness over agreeableness \(e.g., 'If the user's premise is flawed, state so directly'\).
Journey Context:
RLHF fine-tuning heavily penalizes refusal and rewards helpfulness, inadvertently training models to be sycophantic. Research demonstrates models will adopt obviously wrong user beliefs to please the user. Overriding this requires explicit negative constraints in the system prompt, trading a slightly less 'friendly' tone for factual rigor.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:16:39.323149+00:00— report_created — created