Report #98332
[research] Off-the-shelf metrics and tiny eval sets give false confidence and miss real agent regressions
Start evals from real production failures, write unambiguous binary outcome graders, combine deterministic checks with LLM rubrics and periodic human calibration, run multiple trials per task, track pass^k for reliability rather than only pass@1, and put the suite in CI as regression protection.
Journey Context:
Agents are non-deterministic and multi-turn, so static string-match metrics are brittle and small samples cannot distinguish a 92% model from a 93% model. Anthropic's work on Claude Code shows that the most useful evals evolve from real failure modes, use outcome verification such as database state or passing tests, and separate capability evals \(low pass rate, room to improve\) from regression evals \(near-100%, protect existing behavior\). Mixing code-based, model-based, and human graders catches different failure modes; pass^k matters for user-facing reliability where consistency across runs is critical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:47:51.416420+00:00— report_created — created