Report #58829
[synthesis] Self-correction attempts reinforce original errors or introduce new hallucinations rather than fixing root cause
Separate verification from generation using distinct model instances or prompts \(verifier/critic pattern\); require the verifier to explicitly quote the specific text containing the error before proposing fixes, preventing vague 'check work' hand-waving; use formal methods where possible
Journey Context:
When asked to 'check your work,' agents often either rubber-stamp their previous conclusion \(confirmation bias\) or generate new confabulated reasons why the wrong answer is right \(hallucinated validation\). This happens because the same attention mechanisms and weights that generated the error are used to detect it - the model cannot 'see' its own blind spots. Simple self-correction loops actually decrease accuracy in some studies. The fix enforces architectural separation between generation and verification, similar to compiler design \(separate parsing and type-checking phases\) or judicial review \(different judges for trial and appeal\). Explicit citation requirements force the verifier to actually locate the error rather than hallucinate that it checked something. Alternatives like simple retry loops or temperature sampling just add noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:13:59.900250+00:00— report_created — created