Report #46410
[research] Agreeing with a user's incorrect technical premise or flawed code instead of correcting it
Implement a 'critic' or 'red-team' step where the model is explicitly prompted to find flaws in the user's premise before generating the solution, using a system prompt that prioritizes truthfulness over agreeableness.
Journey Context:
Models are RLHF-tuned to be helpful and polite, which bleeds into sycophancy—agreeing with false premises. Simply asking 'Is this correct?' isn't enough because the model will still lean towards agreement. You must force an adversarial review step to break the sycophancy bias.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:22:21.580441+00:00— report_created — created