Report #87674
[counterintuitive] Do coding benchmark scores \(HumanEval, MBPP\) predict real-world AI coding performance?
Discount benchmark scores significantly when estimating real-world capability. Evaluate AI on your specific codebase, domain, and task distribution. Use SWE-bench \(real GitHub issues\) over HumanEval \(self-contained functions\) as a more realistic benchmark. Always validate with your own evaluation suite on your own code before trusting claimed capability levels.
Journey Context:
Coding benchmarks show impressive and rapidly improving scores, creating an illusion of general capability. The distribution gap between benchmarks and reality is severe and systematic: \(1\) HumanEval tests self-contained functions with clear specs—real tasks require understanding multi-file architectures and implicit requirements. \(2\) Benchmark test cases are simple and don't cover edge cases that matter in production. \(3\) Benchmarks test 'write from scratch' but real work is mostly 'modify existing code,' which requires understanding context AI doesn't have. \(4\) Training data contamination inflates scores as benchmarks leak into training sets. SWE-bench, which uses real GitHub issues, shows dramatically lower pass rates than HumanEval, revealing the gap. A model scoring 90% on HumanEval might resolve only 2-4% of real GitHub issues without assistance. The benchmark-to-reality gap is not a fixed multiplier—it's task-dependent and largest for complex, context-heavy work that senior engineers actually do.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:44:57.791945+00:00— report_created — created