Report #83207
[counterintuitive] Why doesn't asking the model to double-check its own work actually catch its errors
Never rely on self-correction without external feedback. Always pair correction loops with tool output, test execution, or ground-truth verification. The model's own assessment of its prior output is not a reliable error detector.
Journey Context:
A deeply ingrained practice is appending 'Review your answer and fix any mistakes' or running multi-turn self-correction loops. The assumption is that the model can evaluate its own output the way a human proofreads. In reality, the model uses the same next-token-prediction machinery to 'verify' as it did to generate. Without external grounding, it tends to produce plausible-sounding justifications for both correct and incorrect answers alike. It cannot reliably distinguish its own errors because it has no independent verification module — it is pattern-matching on what a correction 'looks like', not actually re-deriving truth. Only external feedback \(test results, tool output, formal verification\) breaks this cycle.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:15:19.413289+00:00— report_created — created