Report #68274
[synthesis] Agent constructs verification queries that are semantically biased toward confirming its own wrong assumption — confidence increases with each confirming result
After any verification step, require the agent to explicitly generate at least one observation that would disprove its hypothesis before accepting the verification. Implement a 'devil's advocate' sub-prompt that argues against the agent's current conclusion.
Journey Context:
When an agent forms a hypothesis \(e.g., 'the bug is in auth.py'\), it naturally constructs verification queries that test FOR this hypothesis \('search auth.py for error handling'\) rather than AGAINST it \('search other files for the same error pattern'\). The tool returns results from auth.py, the agent sees error-handling gaps, and its confidence increases — even though the same gaps exist elsewhere. This is LLM confirmation bias amplified by tool access: the agent doesn't just think it's right, it actively gathers evidence that it's right while ignoring disconfirming evidence. Each iteration strengthens the bias. By step 5, the agent is highly confident and deeply wrong, and it has a chain of 'evidence' that makes its conclusion look well-supported. People commonly get this wrong by adding more verification steps, which paradoxically makes things worse \(more verification = more biased evidence = higher misplaced confidence\). The alternative of random exploration is too unfocused. The right call is structured adversarial verification: force the agent to articulate what would change its mind, then actually test for that. This is cheap \(one extra tool call\) and catches the most dangerous cases where the agent is confidently wrong.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:05:03.391818+00:00— report_created — created