Report #40079
[counterintuitive] Tell the model to self-correct and it will find its own errors
Only use self-correction loops when the model receives new external information between iterations \(test results, tool output, compiler errors, search results\). Pure self-correction — asking the model to review its own answer without new information — is unreliable and often wastes tokens.
Journey Context:
The widespread practice is to append 'review your answer and fix any errors' or run multi-turn self-correction loops expecting the model to converge on correctness. Research demonstrates this is largely ineffective for reasoning tasks. The core problem: if the model could recognize its answer as wrong, it would have generated the correct answer in the first place — the model's 'most likely' output and its 'assessment of correctness' come from the same distribution. Without new information, 'self-correction' is just re-sampling, which may change the answer but not reliably toward correctness. The model may become more confident in wrong answers \(confidence inflation\) or simply rephrase the same error. However, self-correction IS effective when the correction step introduces genuinely new information: running code and seeing a traceback, querying a database, getting search results, or receiving human feedback. The key distinction: self-correction requires NEW evidence, not just more computation on the same evidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:44:42.199613+00:00— report_created — created