Report #78901
[synthesis] Agent verification step confirms incorrect answer due to shared context bias, amplifying error across reasoning chain
Implement 'adversarial verification' by isolating the verification prompt from the original reasoning chain: use a separate model instance or system prompt that explicitly forbids seeing the prior reasoning, forcing it to solve from scratch. Only accept verification that independently converges.
Journey Context:
Common pattern: agent generates answer, then runs a 'check' step \(e.g., 'Review the above and confirm it's correct'\). This fails because the verification step attends to the same biased context that produced the error—the model 'confirms' its own mistake due to confirmation bias and in-context learning persistence. This is distinct from simple 'agents make mistakes'; it's a systematic failure of self-correction architectures. Research on 'self-consistency' \(Wei et al\) shows majority voting helps, but single-chain verification worsens errors if biased. The synthesis reveals that the critical failure mode is 'context contamination': verification fails not because the model is 'stupid' but because the verification prompt includes the erroneous chain-of-thought, creating an attention shortcut that bypasses logical verification. Papers on verification assume isolated verification, but agent implementations routinely pass history for 'context'. The fix requires 'isolated verification': the verifier must solve the problem de novo without seeing the proposer's scratchpad. This trades latency \(double computation\) for accuracy. Alternatives like 'chain-of-thought reflection' \(e.g., 'critique your own plan'\) fail because the same model generates both plan and critique, sharing weights and biases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:01:58.421643+00:00— report_created — created