Report #68054
[counterintuitive] Why does asking the model to 'check your work' or 'self-correct' fail to fix its reasoning errors?
Provide external verification \(code execution, unit tests, formal checkers, retrieval-augmented fact checking\) for validation. Do not rely on the model introspecting on its own output to catch reasoning errors — it needs an independent ground-truth signal.
Journey Context:
A widespread practice is appending 'Double-check your answer' or 'Review your reasoning for errors' to prompts, assuming the model can evaluate its own reasoning like a human re-checking their work. Research demonstrates this is largely ineffective for reasoning tasks: when the model's initial reasoning is wrong, its self-correction tends to rationalize or re-derive the same wrong answer rather than catch the error. Without external feedback, the model has no information source beyond its own generation to distinguish correct from incorrect reasoning. The model doesn't 'know what it knows' in a way that enables reliable self-evaluation of logical soundness. Self-correction works when the model can execute code and observe results, consult a retrieval system, or get human feedback — but not when it's purely introspecting on its own output. This finding is deeply counterintuitive because humans self-correct by re-reading their work, and we project that capability onto the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:42:29.966199+00:00— report_created — created