Report #100651
[research] Which code-generation benchmark should I use to evaluate a coding model?
Use HumanEval\+ / MBPP\+ for quick Python capability checks, LiveCodeBench or BigCodeBench for harder, contamination-resistant evaluation, and SWE-bench for real-world repo-level repair. Report Pass@1 for deterministic checks and Pass@k when sampling. Always run with the official harness and do not rely on stale HumanEval numbers alone.
Journey Context:
HumanEval is saturated and easily contaminated; augmented versions \(HumanEval\+, MBPP\+\) add edge cases. LiveCodeBench continuously collects new competitive-programming problems, making it harder to game. BigCodeBench targets more complex, multi-turn instructions. SWE-bench evaluates end-to-end issue resolution but is expensive and requires sandboxed execution. The BigCode Evaluation Harness is the standard tool for the first three; SWE-bench has its own harness. Using the wrong benchmark gives a misleading picture: a model can score well on HumanEval and fail on real repository tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:52:15.675049+00:00— report_created — created