Report #59069
[research] Agent quality degrades silently — aggregate pass rates stay flat while edge-case performance erodes
Build a stratified eval suite with explicit per-category tracking. Break evals into slices: input complexity tiers, tool combinations, error-recovery scenarios, historically-failed cases. Track per-slice metrics over time, not just aggregate scores. Set per-slice alert thresholds, not just global ones.
Journey Context:
The insidious thing about agent degradation is that it's often invisible. An agent might still produce grammatically correct, plausible-looking outputs that are subtly wrong on specific input categories. Aggregate metrics \(like '85% pass rate'\) mask slice-level regressions — if easy cases improve while hard cases degrade, the aggregate looks stable. This is the same insight from ML production monitoring: slice-level metrics are the only reliable signal. The tradeoff is more eval maintenance and more alerts to triage, but catching silent degradation early prevents the compounding failures that happen when edge cases are your most important users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:38:13.664593+00:00— report_created — created