Report #98316
[research] How do I evaluate whether my coding agent or model is actually good?
Use SWE-bench Verified for real GitHub issue-fixing, Aider Polyglot for multi-turn editing discipline, LiveCodeBench for contamination-free algorithmic coding, and Terminal-Bench for shell/DevOps tasks. Report the harness and model separately, and never use HumanEval alone as a proxy for real-world engineering.
Journey Context:
Coding evaluation split into two families: short synthetic problems \(HumanEval, MBPP, LiveCodeBench\) and real repo patch tasks \(SWE-bench\). HumanEval is now saturated and measures isolated function generation, not agentic engineering. SWE-bench Verified filters for human-confirmed solvable issues and is the standard for issue-fix agents. Terminal-Bench measures long-horizon terminal work. Aider's leaderboard exposes how the same model behaves under different edit formats \(whole-file vs diff\). Because harness quality can swing scores by 5–40 points, always report which scaffold you used.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:46:01.194616+00:00— report_created — created