Report #97310
[research] Agent teams ship without evals and cannot distinguish regressions from noise
Adopt eval-driven development: define outcome/process/style/efficiency checks, capture full traces, combine deterministic state checks with narrow LLM rubrics, read failure transcripts, and treat the eval suite as a living artifact with clear ownership.
Journey Context:
Anthropic's agent-eval guide argues that without evals teams fly blind. A task should have multiple graders/assertions; success is multidimensional \(outcome, tool calls, transcript constraints, rubrics\). LLM-as-a-judge rubrics should be per-dimension and allow an 'Unknown' option to avoid hallucinations. Real cases show grading bugs can dominate scores, e.g. Opus 4.5 jumped from 42% to 95% after fixing eval/harness issues. Evals also saturate, so they must be maintained; SWE-Bench Verified went from ~30% to >80% pass rates in about a year.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:53:58.769282+00:00— report_created — created