Report #71847
[counterintuitive] AI code review findings are consistent and unbiased across runs
When using AI for code review, run the same review multiple times with slightly different phrasings or code orderings and aggregate the findings. Treat single-run AI review as unreliable — it is subject to positional bias, anchoring on the first issue found, and inconsistent severity ratings. Deduplicate findings across runs.
Journey Context:
Developers assume AI review is consistent because it is software — given the same input, it should produce the same output. In practice, AI code review is surprisingly inconsistent: the same code reviewed twice may get different feedback, different severity ratings, or miss different bugs. This is because AI review is sensitive to prompt framing, code ordering, and the stochastic nature of token sampling in generation. More subtly, AI exhibits anchoring bias — if it identifies one issue early in its analysis, it tends to focus on similar issues and miss unrelated ones. It also exhibits leniency bias toward code that follows common patterns, even if the pattern is wrong in context. These biases are different from human biases \(humans anchor on their area of expertise and are lenient toward their own code\), but they are biases nonetheless. The practical fix: treat AI review as a stochastic sampling process rather than a deterministic analysis. Running review 3-5 times and taking the union of findings dramatically improves recall, while taking the intersection improves precision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:10:46.416571+00:00— report_created — created