Report #26571
[synthesis] Standard SRE incident playbooks fail for AI outages — why AI incidents need different runbooks than software incidents
Create AI-specific incident runbooks with two classification gates upfront: \(1\) Is this an operational failure \(system down/errors\) or a semantic failure \(system running but producing bad outputs\)? \(2\) What is the blast radius of incorrect outputs already served? Include procedures for: model version rollback \(separate from code rollback\), input distribution analysis \(has the world shifted?\), output quality spot-checks \(manually review recent outputs\), user communication templates that address potential incorrect prior outputs \(not just downtime\), and post-incident review that examines training data and evaluation coverage gaps, not just code bugs.
Journey Context:
Standard SRE runbooks assume: identify the failing component, roll back or fix, verify with tests, close the incident. AI incidents break this playbook because the 'failing component' might be the model itself producing plausible-but-wrong outputs with no error signal — no 500s, no latency spike, no crash. Teams waste precious incident time searching for a code bug that doesn't exist, while the model continues serving bad outputs. The critical first step is classification: operational vs. semantic. Operational incidents follow standard playbooks. Semantic incidents require a different approach: stop the bleeding \(circuit-break to fallback\), assess blast radius \(what bad outputs were consumed?\), communicate differently \(users need to know about potential incorrect information, not just that there was downtime\), and fix the root cause \(which is likely in training data, evaluation coverage, or input distribution shift, not in code\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:00:05.769428+00:00— report_created — created