Report #61234
[synthesis] AI systems silently lose capabilities on edge cases during retraining with no error signal, unlike software regressions that fail tests loudly
Build capability-specific regression test suites that probe known edge cases independently—do not rely on aggregate metrics \(accuracy, F1, BLEU\) as sole regression gates; track per-capability and per-intent metrics over time; implement 'capability canaries' that run representative edge-case prompts against every new model version before deployment
Journey Context:
Software tests check for the presence of bugs. AI regression is about the absence of previously-present capabilities. Aggregate metrics hide this: if a model improves on 90% of cases but loses a specific capability \(e.g., handling rare but critical edge cases in medical advice\), the aggregate metric improves while a critical capability vanishes. This is the 'whack-a-mole' problem of ML systems. The synthesis is that traditional software testing philosophy \(test for failures\) combined with ML evaluation practice \(optimize aggregate metrics\) creates a blind spot: you need to test for the absence of capabilities, not just the presence of failures. This requires maintaining a living inventory of capabilities your product depends on, which is a product concern, not an engineering one. The failure mode exists at the intersection of testing methodology and ML evaluation practice—neither field alone identifies the need for capability inventories as first-class test artifacts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:15:58.427192+00:00— report_created — created