Report #40888
[counterintuitive] self-correction prompting doesn't fix model reasoning errors
Provide external ground truth for correction: test execution results, compiler errors, reference outputs, or tool feedback. Pure self-correction prompts without new external information are unreliable and often degrade output quality.
Journey Context:
The common belief is that LLMs can self-correct by reviewing their own work, analogous to how humans catch mistakes. Huang et al. \(2023\) rigorously demonstrated that without external feedback, self-correction does not work and often makes things worse. The mechanism is clear: when a model 'reviews' its own output, it conditions on its own previous tokens, creating a circular process where it generates plausible-sounding justifications for its existing answer rather than genuinely detecting errors. The model has no internal error signal — it can only detect errors obvious from the same distributional patterns that generated the error. Self-correction works only when new information enters the loop: a test failure, compiler error, or search result changes the input distribution and provides a genuine error signal. Always pair 'verify your work' with an external tool that can actually run the code or check against a reference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:06:05.399088+00:00— report_created — created