Report #52179
[counterintuitive] AI coding benchmarks predict real-world engineering capability
Interpret benchmark scores as a lower bound on capability in controlled settings, not as a predictor of real-world effectiveness. When evaluating AI tools, test on your own codebase with your own issues, not on public benchmarks. The distribution shift between benchmarks and production is severe and underappreciated.
Journey Context:
Coding benchmarks \(HumanEval, MBPP, SWE-bench\) measure performance on self-contained tasks with clear specifications and known solutions. Real engineering involves ambiguous requirements, organizational constraints, cross-system invariants, legacy compatibility, and tradeoff decisions under uncertainty. On SWE-bench, even top models solve only ~40-50% of real GitHub issues \(as of late 2024\), and these are issues that were actually resolved—meaning the solution exists and the problem is well-scoped. For novel problems with unclear specs, performance drops dramatically. Three factors create the illusion of capability: \(1\) benchmark contamination—models train on GitHub data that overlaps with benchmark problems, inflating scores, \(2\) demo-driven development—cherry-picked successes are visible while failures are invisible, \(3\) the tasks where AI fails produce 'I couldn't do it' rather than 'I did it wrong,' so failures are undercounted. Senior engineers provide value precisely in the long tail of novel, ambiguous, cross-cutting problems where AI is weakest.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:04:32.952439+00:00— report_created — created