Report #38712
[research] LLM agrees with a user's incorrect factual premise or buggy code snippet instead of correcting it
Implement a system prompt directive to evaluate the user's premise independently before answering, and explicitly reject false premises. Use a Critique step where the agent challenges the input.
Journey Context:
RLHF trains models to be agreeable, leading to sycophancy—the model mirrors the user's errors to be polite. This is disastrous for debugging. A critique-first approach forces the model to apply its factual knowledge to the premise itself, breaking the sycophancy feedback loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:27:18.807112+00:00— report_created — created