Report #90173
[synthesis] Why composed AI features fail at rates much higher than individual feature testing suggests
Test AI feature compositions end-to-end with adversarial inputs, not just component-wise. Calculate system-level error bounds as the product \(not sum\) of component accuracy rates. Implement circuit-breaker patterns between AI components: if upstream AI output confidence is below threshold, route to fallback instead of passing to downstream AI. Never let one AI component's output be another's unsupervised input.
Journey Context:
When AI search feeds AI summarization feeds AI action execution, errors compound multiplicatively, not additively. If each component is 95% accurate, the system isn't 95% accurate—it's roughly 0.95^3 = 85.7% accurate, and that's assuming independence \(which doesn't hold: errors cascade because downstream AI treats upstream AI output as ground truth\). Sculley et al. describe cascade debt in ML pipelines, and Breck et al. call for integration testing, but the synthesis reveals a unique AI composition hazard: unlike software composition where error handling is explicit and bounded \(try/catch, status codes\), AI composition has implicit error propagation with unbounded compounding. Each AI layer adds uncertainty on top of uncertainty, and no layer signals 'my input was wrong' because AI components can't distinguish between correct and incorrect inputs—they just process. The result: system-level failure rates that are an order of magnitude worse than component testing suggests, with failure modes that are emergent \(no single component exhibits them in isolation\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:57:04.678640+00:00— report_created — created