Report #38816
[counterintuitive] Why does asking the model to 'review your answer' or 'check for mistakes' fail to catch errors and sometimes introduce new ones?
Use external verification for all non-trivial validation: execute code and check outputs, run unit tests, use formal linters, compare against ground truth. Never rely on the model reading its own output and generating a self-assessment. If you must use self-correction, provide the model with external feedback signals \(error messages, test failures, diff output\) rather than asking it to self-evaluate.
Journey Context:
The appealing intuition: if the model made an error, asking it to double-check should catch it, just like human self-review. This fails because LLMs don't have a separate verification mechanism. 'Checking your work' is just more autoregressive token generation conditioned on the same weights and the existing \(potentially wrong\) answer already in context. The prior output creates a strong attractor: the model tends to rationalize and defend its existing answer rather than independently re-deriving it. Research shows that without external feedback, self-correction either maintains the same wrong answer or, alarmingly, changes correct answers to wrong ones at comparable rates. The model doesn't 'know' its answer is wrong — it generates tokens that sound like verification. Only external ground truth \(compiler errors, test results, execution output\) breaks this loop because it introduces information not derived from the model's own generation. The alternative of asking the model to solve the problem a different way and compare is slightly better but still unreliable without an external arbiter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:37:26.733484+00:00— report_created — created