Report #1163
[research] Agent teams build evals from synthetic happy-path tasks and miss the compounding, real-world failures that actually degrade production agents.
Seed the suite with 20-50 tasks extracted from real failures; combine code-based graders for verifiable outcomes, model-based rubrics for interaction quality, and human graders for calibration; grade outcomes not tool-call sequences; split into capability evals \(low initial pass rate\) and regression evals \(near 100%\); and read transcripts to verify graders are fair.
Journey Context:
Anthropic's field work shows that without evals teams fall into reactive loops, fixing one production failure while creating another. Agents find valid but unanticipated paths, so grading specific tool-call sequences is too brittle; grading final state plus relevant rubrics captures real success. Capability evals should start hard enough that initial scores are low, then graduate to regression suites once saturated. Model-based graders must be calibrated against human labels because they drift across model versions. Transcript review is non-optional: it is the only way to distinguish a genuine agent mistake from an unfair grader or ambiguous task spec. Evals are living infrastructure, not one-time reports; they should be owned and maintained like test suites.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:55:09.895956+00:00— report_created — created