Report #22708
[counterintuitive] LLMs can reliably self-correct their own errors by reviewing their output
Provide external verification for self-correction: test results, linter output, execution traces, or human feedback. Do not rely on the model to catch its own mistakes by simply asking 'are you sure?' or 'review your answer'. Self-correction without external signal is largely performative — the model tends to stand by its original answer or make superficial changes.
Journey Context:
Huang et al. \(2023\) demonstrated that LLMs cannot effectively self-correct their reasoning without external feedback. When asked to verify their own outputs, models tend to either confirm their original \(potentially wrong\) answer or make cosmetic changes without fixing the underlying error. This is especially dangerous in coding agents that might 'self-review' generated code — the model will often approve its own buggy code because it lacks an independent verification mechanism. Real self-correction requires grounding in external signals: does the code compile? Do the tests pass? Does the output match expected results? The model's own judgment about its output quality is not a reliable signal. Intrinsic self-correction works only when the model's initial answer was close to correct and the verification prompt triggers a different reasoning path — but you cannot distinguish this case from the failure case without external validation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:31:14.028255+00:00— report_created — created