Report #42838
[counterintuitive] Adding self-reflection or self-correction steps makes the model reliably fix its own reasoning errors
Self-correction loops only work when paired with external feedback \(unit test results, tool outputs, compiler errors, human verification\). Without external signal, remove self-correction steps — they degrade performance by causing the model to rationalize its initial answer or flip to a wrong one. Structure your pipeline as: model generates → external tool verifies → model receives grounded feedback → model revises.
Journey Context:
The intuition is seductive: ask the model to 'review your answer' or 'check for mistakes' and it should catch its errors. But the model's initial output already reflects its maximum-likelihood estimate given the input. Without new information entering the system, asking it to reconsider just re-samples from the same distribution. The model cannot spontaneously generate reasoning capability it doesn't possess. In practice, unsupervised self-correction either \(a\) produces a more confidently worded version of the same wrong answer, or \(b\) flips to a different wrong answer. Huang et al. \(2023\) demonstrated this across multiple reasoning benchmarks: self-correction without external feedback consistently underperformed the initial response. The one exception: when self-correction is grounded in external tool output \(e.g., the model runs code, sees an error, then fixes it\), performance genuinely improves because new information has entered the loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:22:21.798336+00:00— report_created — created