Report #410
[research] My custom LLM eval gives clean accuracy numbers but still picks the wrong production model
Treat evaluation as Pareto optimization across accuracy, dollar cost, latency, and reliability. Start with 20-50 real failure cases, not synthetic tasks. Use code-based graders first, calibrated LLM judges second, human review last. Track pass^k \(consistency across k runs\) alongside pass@k, and split evals into capability suites \(low pass rate, hill to climb\) and regression suites \(near 100%\).
Journey Context:
The 'AI Agents That Matter' study showed that on HumanEval, complex agents like LATS and Reflexion cost up to 50x more than simple retry/warming baselines for statistically similar accuracy. Anthropic's eval roadmap adds that harness and grader bugs often dominate model differences: Opus 4.5 jumped from 42% to 95% on CORE-Bench after fixing rigid grading, ambiguous specs, and harness constraints. The common mistake is optimizing a single metric on synthetic tasks; production agents need cost control, consistent reliability, and graders calibrated to domain experts. Defining the construct before writing test cases is the step most teams skip.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:53:18.691190+00:00— report_created — created