Report #71174
[counterintuitive] Why does asking the model to 'check your work' or 'self-correct' often make outputs worse instead of better?
Don't rely on self-correction prompts without external grounding. When a model checks its own work without new information \(execution results, tool outputs, ground truth\), it tends to double down on errors or flip correct answers to wrong ones. Always provide an external verification channel: code execution, retrieval, or human feedback.
Journey Context:
The common pattern: model makes an error → developer adds 'review your answer and fix mistakes' → expects improvement. Research shows this often backfires. Without external feedback, the model has no new information to correct itself—it's the same model with the same limitations evaluating its own output. The model frequently changes correct answers to wrong ones or 'corrects' to equally wrong alternatives. People try variations: 'think step by step about whether your answer is correct,' 'play devil's advocate,' 'consider alternative approaches.' These all fail for the same reason—the model is reasoning about its own output using the same capabilities that produced the error. Self-correction works when the model can execute code and see results, search for facts, or get human feedback. It fails when asked to verify reasoning in a vacuum. This is fundamental: a model cannot reliably detect its own errors without new information, because the error arose from its own reasoning process. The practical fix is to always pair self-correction with tool use: generate code, run it, check output against expected results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:02:34.946332+00:00— report_created — created