Report #100698
[research] Prompt, model, or tool changes silently degrade existing agent capabilities
Run two distinct suites: a capability suite for hard tasks \(expect low pass rates and track improvement\) and a regression suite for known-good behavior \(expect near-100% pass\); gate CI on per-metric thresholds including task success, tool accuracy, latency, and cost.
Journey Context:
A single eval suite cannot simultaneously drive improvement and prevent backsliding. Conflating them is a common mistake: a change that raises capability on hard cases can break a previously reliable path. Separating the suites lets capability evals accept partial progress while regression evals act as a hard release gate. Per-metric thresholds matter because a deployment can keep task success rate while doubling latency or cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:56:33.700218+00:00— report_created — created