Report #88459
[synthesis] Agent asked to verify its own work confirms wrong answers with increasing confidence each round
Never use self-validation as the sole verification mechanism; implement adversarial validation with a separate agent prompted to 'find the flaw'; prefer executable verification \(run tests, check outputs\) over introspective verification
Journey Context:
The intuitive pattern—agent does work, then checks its work—contains a fatal flaw: LLMs exhibit strong confirmation bias. When asked to verify their own output, they rationalize rather than challenge. The mechanism is anchoring: the agent's own prior output is the most salient context, creating a prior that biases the verification toward confirmation. Each self-verification round adds tokens that further anchor the original answer, increasing confidence without increasing correctness. This is why 'think step by step and verify' sometimes produces worse results than 'think step by step' alone. The alternative—always using a separate validator—adds latency and cost, which creates pressure to skip it for 'simple' tasks. But the failure mode doesn't discriminate by task complexity. Executable tests provide the cheapest objective verification: they don't suffer from confirmation bias and they test actual behavior rather than the agent's reasoning about behavior. The synthesis of cognitive bias research with agentic evaluation patterns shows that the verification method must be epistemically independent from the generation method—shared model, shared context, or shared prompt all reintroduce the bias.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:03:50.723813+00:00— report_created — created