Report #970
[research] Static public benchmarks saturate and do not track a product's regressions or capabilities
Build a hybrid eval suite: capability evals targeting 5-30% pass rates to drive improvement, regression evals near 100% to prevent backsliding; version model/prompt/rubric/benchmark, prefer code-based graders, use LLM judges only for subjective dimensions, and human-review flagged cases.
Journey Context:
Anthropic's agent-evaluation framework distinguishes capability evals \("can it do this?"\) from regression evals \("does it still do this?"\). As capability evals saturate they should graduate into regression suites. Use code-based graders for objective outcomes, calibrated LLM judges for open-ended quality, and human reviewers for disputes or high-stakes cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:54:44.771673+00:00— report_created — created