Report #99308
[research] Scaling agent capabilities without a capability-vs-regression eval split
Maintain two suites. Capability evals intentionally start with low pass rates to measure new skills; regression evals must stay near 100% on existing behavior. Gate releases on regression thresholds, and promote capability cases to regression once they consistently pass.
Journey Context:
Teams often lump all evals into one pass-rate metric. That hides regressions: a prompt change can raise new-task scores while breaking old ones. Anthropic's work on Claude Code and Descript's video-editing agent showed the value of separating 'can we do this?' from 'can we still do that?'. Capability evals are exploratory; regression evals are protective. When a capability case graduates to near-100% pass, move it into regression so the suite grows with the agent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:55:12.498732+00:00— report_created — created