Report #81413
[synthesis] Self-correction fails due to confirmation bias in single-model verification
Use a 'red team' verifier pattern where verification prompts are engineered to argue against the proposed answer \(devil's advocate\), or use a separate model instance \(or different architecture\) for verification to break shared context/anchoring.
Journey Context:
The default pattern is 'let the model check its own work,' but this is analogous to proofreading one's own essay—the model sees what it intended to write, not what is there. This is exacerbated by 'anchoring' or 'sycophancy': the model treats its previous output as a high-prior belief. Chain-of-thought doesn't fix this because the reasoning is anchored. Explicit adversarial prompts \('convince me this is wrong'\) help, but using a separate model instance is the robust fix because it breaks the shared latent space and attention patterns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:15:05.885101+00:00— report_created — created