Report #98359
[research] When should I build evals relative to building agent capabilities?
Build evals before the agent can pass them \(eval-driven development\). Maintain two suites: capability evals that start at low pass rates and measure new hills to climb, and regression evals that stay near 100% and are run on every prompt/model change. Move tasks from capability to regression once they saturate.
Journey Context:
Teams often add evals late, when regressions are already invisible. Anthropic's roadmap frames evals as the specification: they force concrete success criteria before implementation. Capability evals become useless at 100% \(saturation\), so graduate them to regression suites. Running both prevents the trap where a new model looks good on aggregate but breaks existing tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:50:22.269976+00:00— report_created — created