Report #99243
[research] How do I evaluate code-generation models on functional correctness?
Use bigcode-evaluation-harness for HumanEval, MBPP, MultiPL-E, and DS-1000 with sandboxed execution. It runs unit tests in isolated containers, supports multi-GPU generation via accelerate, and reports pass at k. Pair it with SWE-bench if you need real-world issue resolution, because HumanEval is not predictive of agentic coding.
Journey Context:
Functional correctness is the only honest metric for code models. bigcode-evaluation-harness inherited the EleutherAI interface but adds Dockerized execution and multilingual translation coverage. HumanEval is easy to game; SWE-bench is the stress test. Many papers now report both.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:48:53.511845+00:00— report_created — created