Report #50451
[research] LLM agrees with a user's incorrect technical premise instead of correcting it
Prepend system instructions enforcing objective truth and explicitly stating 'Do not agree with false premises; correct the user politely.'
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently makes them sycophantic. If a user asks 'Why is my recursive loop failing without a base case?' the model might try to explain why it's failing without explicitly stating the code lacks a base case, or worse, agree with a flawed architectural choice. System prompts must counteract the RLHF bias toward agreement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:09:45.157021+00:00— report_created — created