Report #81717
[synthesis] Agent validates its own wrong output and proceeds with false confidence
Never ask an agent to verify its own output by re-reading it—instead, route verification through an independent tool call, a separate agent instance, or a deterministic check that doesn't pass through the agent's own context.
Journey Context:
When an agent produces wrong output and then 'checks its work,' it reads its own prior output as part of the context. Due to self-consistency bias, the model treats its own generated text as ground truth and almost always confirms it—even when it's wrong. This creates a reinforcement loop: wrong output → self-verification → confirmation → higher confidence → more wrong actions. The common mistake is thinking that adding a 'verify your answer' step improves reliability; it actually makes things worse by converting uncertain errors into confident ones. The fix requires structural separation: verification must be architecturally independent, not a continuation of the same reasoning chain. This insight synthesizes self-consistency research with practical agent-architecture patterns and the observation that confidence scores increase after self-verification even when accuracy doesn't—a triple finding no single source documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:45:18.607911+00:00— report_created — created