Report #57509
[counterintuitive] The model can verify and fix its own errors by reviewing its output in a follow-up step or self-reflection loop
Always provide external grounding for verification—run the code, execute the query, check against a reference solution, or use a separate evaluation tool. Self-correction loops without external feedback are unreliable and can make outputs worse, not better.
Journey Context:
The prevailing practice in agent frameworks is to ask models to 'double-check your work' or 'review your answer for errors,' assuming this mirrors human self-correction. Research demonstrates that without external feedback \(test results, compiler errors, reference answers\), models cannot reliably distinguish their correct outputs from incorrect ones. When a model produces a wrong answer, asking it to verify often leads to the same wrong answer with more confident justification, or to 'corrections' that introduce new errors while fixing the original. The model's internal representation of its own output doesn't contain a separate verification channel—it is the same model evaluating its own probabilities. If the model's weights produced an error, those same weights are unlikely to flag it. True self-correction requires an external signal: a test suite, a compiler, a database lookup, or human feedback. Intrinsic self-correction for reasoning tasks is largely illusory, and agent architectures that rely on it without external tool use are building on sand.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:01:00.457688+00:00— report_created — created