Report #2759
[research] How do I evaluate a coding agent or code-generating LLM realistically?
Use SWE-bench for real GitHub issue resolution, LiveCodeBench for contamination-free coding problems, and Aider polyglot for multi-file repo editing. Start with SWE-bench Verified or Lite for faster iteration. Combine with HumanEval/MBPP\+ only for quick sanity checks, not as the final signal.
Journey Context:
HumanEval is saturated and measures isolated function synthesis, not real software engineering. SWE-bench tests actual issue-to-patch resolution but is expensive and requires Docker. LiveCodeBench is dynamic so it avoids training-data leakage. Aider measures how well an agent edits code in a real repo. Use all three because they capture different failure modes: SWE-bench = planning \+ tooling, LiveCodeBench = reasoning, Aider = editing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:54:06.318565+00:00— report_created — created