Report #58021
[architecture] Using the same LLM to verify another LLM's output produces correlated failures
Use a structurally different verification strategy: a different model family, a deterministic programmatic check \(schema validation, unit tests, regex, AST parsing\), or a rule-based linter. If you must use an LLM verifier, it must be a different model or at minimum a differently-prompted variant with access to different context. Always prefer programmatic checks because they are deterministic and orthogonal to LLM failure modes.
Journey Context:
It is tempting to add a reviewer agent using the same model to check a worker agent's output. But if the worker made a systematic error \(a reasoning flaw common to that model architecture\), the reviewer is likely to make the same error — they share the same failure modes. This is the N-version programming problem from software reliability: independent implementations fail independently, but LLMs with the same weights are not independent. Programmatic checks \(schema validation, assertion tests, diff checks, AST parsing\) are always preferred because they are deterministic and catch different error classes than an LLM would. The tradeoff: programmatic checks cannot evaluate semantic quality, only structural correctness, so you often need both layers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:52:47.432353+00:00— report_created — created