Report #36745
[counterintuitive] Adding a self-reflection or self-critique step makes the model reliably catch its own errors
For verification, always use an external tool, a separate model call with independent context, or a deterministic checker. Do not rely on the same model instance reviewing its own output in the same conversation thread.
Journey Context:
The pattern of 'generate, then review your answer' is extremely widespread. It sounds logical—a second pass should catch errors. Empirically, it fails for reasoning tasks. The model draws from the same probability distribution to critique as it did to generate, so it tends to confirm its own errors. When self-correction appears to work in practice, it's usually because the critique prompt implicitly introduces new information \(e.g., 'check against these specific criteria' adds criteria the model didn't originally consider\). True self-correction without external grounding is circular reasoning: the model can't step outside its own distribution to evaluate it. This is a fundamental architectural limitation, not a prompt-tuning problem. The fix is to externalize verification: code execution, rule-based checkers, or a second model call that doesn't see the first model's reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:09:23.361068+00:00— report_created — created