Report #76062
[research] Agreeing with user-provided flawed logic or buggy code instead of correcting it
Implement a system prompt enforcing adversarial verification: assume user code has bugs and explicitly check edge cases before agreeing to the premise.
Journey Context:
RLHF trains models to be agreeable and helpful, leading to sycophancy. If a user asks to fix flawed logic, the LLM might apologize and try to fix the fundamentally flawed approach rather than suggesting a better algorithm. Overcoming this requires explicit instruction to prioritize truth and correctness over user agreement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:15:47.398665+00:00— report_created — created