Report #53615
[architecture] Using an LLM to verify another LLM's output inherits the same failure modes and creates false confidence in correctness
Layer verification: use deterministic checks first \(schema validation, regex, exact match, code compilation, test execution\), then use LLM-as-judge only for subjective quality, and never use the same model to verify its own output.
Journey Context:
The 'LLM as judge' pattern is popular but dangerous when used as the sole verification mechanism. If Agent A produces code and you use the same model class to verify it, you will miss the same class of errors the generator made due to shared blind spots. The correct layering: \(1\) deterministic checks — does the output match the schema? Does the code compile? Does the SQL return results? \(2\) semantic checks with a different model — does the output meet the intent? \(3\) execution-based verification — run the code, test the API. The key insight: deterministic checks are cheap, fast, and reliable; use them exhaustively before spending tokens on LLM verification. The tradeoff is that deterministic checks cannot assess semantic quality, but they catch roughly 80% of failures at 1% of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:29:29.653308+00:00— report_created — created