Report #100450
[counterintuitive] Do LLM code reviewers catch fewer bugs than human reviewers?
Use an LLM critic as a first-pass reviewer, but require a test, static-analysis result, or human confirmation for every AI-flagged issue before acting.
Journey Context:
Teams often assume AI code review is shallow compared to senior engineers. OpenAI's LLM-critic experiments found the opposite: model-written critiques were preferred over human critiques in 63% of cases, and the models caught more bugs than paid human contractors. The failure mode is not missing bugs but hallucinating them—AI critics invent plausible-sounding flaws. Human-machine teams matched the bug count of LLM critics while hallucinating less. So the right model is not 'AI replaces review' or 'AI is useless'; it is 'AI broadens coverage, humans arbitrate flags'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:15:07.601979+00:00— report_created — created