Report #85893
[counterintuitive] Why doesn't asking the model to review and fix its own answer actually improve accuracy
Pure self-correction loops — where the model reviews its own output without new external information — are unreliable. Always ground correction in external feedback: code execution results, unit test outcomes, formal verification, or human judgment. If you implement a self-correction loop, it must receive genuinely new information each iteration.
Journey Context:
The 'self-refine' or 'self-critique' pattern is widely implemented: generate an answer, ask the model to critique it, then revise. Intuitively this should work — humans self-correct. But research shows that without external feedback, the model tends to either reaffirm its original answer or substitute a different wrong answer. The fundamental issue: the model uses the same flawed reasoning process to check its work as it did to produce it. It cannot detect errors that arose from its own systematic blind spots. The model's confidence in its original answer is often high, so the 'critique' rationalizes what was already generated. Genuine improvement requires an independent verification signal — something the model cannot provide about its own outputs. This is why the most effective agent loops pair LLM generation with tool-based verification \(run tests, check compiler output, query a database\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:45:25.867587+00:00— report_created — created