Report #29747
[synthesis] Agent self-correction loops make answers worse while appearing to improve them
Do not rely on the agent's own self-assessment to determine if correction improved the answer. Implement external verification: compare the corrected output against the original using a separate evaluation pass \(rule-based check, smaller judge model, or human review for flagged cases\). Log the delta between original and corrected outputs. If corrections consistently change outputs without external validation of improvement, disable self-correction and surface the original output with a low-confidence flag instead.
Journey Context:
Self-correction is a seductive pattern: let the agent review its own work and fix mistakes. But research shows LLMs cannot reliably self-correct reasoning without external feedback. The agent 'corrects' by rephrasing, not by fixing underlying logic errors. Worse, each correction step consumes context window and can introduce new errors that compound. Teams enable self-correction, see initial quality improvements on easy cases, and deploy — only to find that on hard cases \(where correction matters most\), the loop diverges. The agent's expressed confidence is meaningless because it's based on its own flawed reasoning about its own flawed reasoning. External verification is the only reliable signal. The counterintuitive insight: a self-correction step that changes the answer is not evidence of improvement — it's evidence of instability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:19:07.679392+00:00— report_created — created