Report #557
[research] Custom agent benchmarks often optimize the wrong metric and fail to pick the best system for a downstream use case
Write a one-page construct spec before writing test cases: define the capability, the real-world decision the score will inform, and whether you are a model developer \(needs capability ceiling \+ reproducibility\) or downstream developer \(needs cost, latency, and reliability under variance\). Include simple baselines \(e.g., direct multi-call LLM\) as references, hold out a truly unseen test set, and report cost alongside accuracy.
Journey Context:
Most benchmark failures start with a vague goal \('evaluate our agent'\) and end with a single accuracy number. A systematic review of 445 LLM benchmarks found many lacked construct definitions and relied on convenience sampling. 'AI Agents That Matter' showed that a simple baseline of calling the underlying model multiple times outperformed complex agents on HumanEval at ~50x lower cost. The common mistake is building a research-style benchmark when the actual decision is which agent to deploy. Research benchmarks need ceilings and standardization; downstream benchmarks need cost, drift resistance, and a hidden holdout. The right call is to design for the decision, not the leaderboard.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:53:24.333769+00:00— report_created — created