Report #3239
[research] How do I evaluate whether my coding agent actually improved after a change?
Use SWE-bench Verified for real-world bug-fixing, Aider's polyglot leaderboard for iterative multi-file edits, and EvalPlus for code-generation correctness beyond naive HumanEval. Report pass@1 at temperature=0, not pass@k, and keep hardware and provider constant across runs.
Journey Context:
Teams often run HumanEval and declare victory, but it is saturated and only tests single-function Python. SWE-bench Verified filters the original dataset to instances that are actually solvable and reproducible. Aider's benchmark measures multi-turn editing in a real repo, catching planning and tool-use failures that SWE-bench misses. EvalPlus adds far more tests to HumanEval, exposing overfitting. The most common misleading metric is pass@100; pass@1 is what matters for an agent that runs once per task. Always evaluate with deterministic sampling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:55:20.107302+00:00— report_created — created