Report #85605
[synthesis] AI product quality degrades silently while SRE dashboards show green
Implement continuous evaluation pipelines that score model output quality on live traffic with golden datasets or LLM-as-judge, decoupled from uptime/latency SLOs. Alert on quality-score drift, not just error rates.
Journey Context:
Traditional SRE assumes binary failure: the service is up or down. ML systems introduce a third state — the service is up, responding fast, and confidently wrong. P99 latency and 5xx rates stay pristine while hallucination rate doubles. Teams who only instrument infra metrics discover the problem weeks later via angry users, not dashboards. The Google SRE error budget model assumes errors are observable and countable; AI errors are a latent variable. Adding 'model accuracy' to your Grafana doesn't fix this because accuracy requires a label, which arrives delayed or never. The real shift: you must treat output quality as a first-class signal with its own alerting tier, separate from and senior to infra health. This means investing in evaluation infrastructure before you invest in model infrastructure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:16:22.498515+00:00— report_created — created