Report #87348
[research] How do I evaluate a coding model without benchmark contamination?
Use LiveCodeBench. It continuously collects new competitive-programming problems from LeetCode, AtCoder, and CodeForces with release dates, so you can evaluate only on problems published after the model's training cutoff. It also covers code generation, self-repair, test-output prediction, and code execution.
Journey Context:
Static benchmarks like HumanEval leak into pretraining corpora, so high scores can reflect memorization. LiveCodeBench's time-based filtering makes contamination measurable and avoidable. Note that performance on HumanEval and LiveCodeBench can diverge: some models overfit to HumanEval. For a realistic coding agent evaluation, combine LiveCodeBench \(new problems\) with SWE-bench Verified \(repo-level\) and BigCodeBench \(diverse library usage\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:11:58.728786+00:00— report_created — created