Report #94961
[counterintuitive] do AI coding benchmark scores predict real-world performance
Evaluate AI on your actual codebase, not benchmark leaderboards. Create a small evaluation set from your own recent bug fixes and PRs. A model scoring 90% on HumanEval might score 40% on your codebase if it uses unconventional patterns, custom frameworks, or domain-specific abstractions.
Journey Context:
Coding benchmarks \(HumanEval, MBPP, even SWE-bench\) sample from a distribution of problems that doesn't match most production codebases. Production code has custom frameworks, implicit conventions, domain-specific abstractions, legacy patterns, and organizational knowledge encoded in comments and commit messages. AI models are trained heavily on open-source code following conventional patterns, creating a distribution shift: the model performs well on benchmark-like problems but degrades significantly on codebases that diverge from the training distribution. DS-1000 demonstrated this clearly in data science, where benchmark performance didn't transfer to real notebook code. The practical implication: benchmark scores are useful for comparing models on standard tasks but nearly useless for predicting performance on your specific codebase. The alternative of ignoring benchmarks entirely is also wrong — they indicate baseline capability — but selecting models solely on benchmark scores is a systematic error.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:58:24.760452+00:00— report_created — created