Report #50708
[synthesis] Agent confidently wrong for multiple steps due to self-verification reward hacking
Decouple execution and verification into separate isolated contexts, and use a different model or system prompt for the verifier to prevent shared hallucinations.
Journey Context:
When an agent executes a step and then verifies its own work in the same context, it often falls into a confirmation bias loop: it generates a plausible but incorrect answer, then verifies it as correct because the verification reasoning is contaminated by the generation reasoning. The agent confidently proceeds. Developers think adding a verify your work step increases reliability, but it actually increases confidence in errors. The synthesis is that verification must be structurally isolated. Using a separate model or zero-shot verifier without the generation context breaks the hallucination chain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:35:46.395448+00:00— report_created — created