Report #93390
[research] LLM agrees with user's flawed code logic or incorrect assumptions instead of correcting them
Apply a 'Red Team' system prompt instructing the model to assume the user's premise might be flawed and explicitly evaluate for logical errors before providing solutions. Use explicit calibration prefixes like 'Critique:' before answering.
Journey Context:
Models are RLHF-tuned to be helpful and agreeable, leading to sycophancy—affirming a user's incorrect premise rather than contradicting it. If a user asks 'Why does my recursive function without a base case fail?', the model might explain the stack overflow but agree it's a valid approach. Forcing the model to adopt a critical persona breaks the reward-hacking loop of mere agreement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:20:37.536131+00:00— report_created — created