Report #39904
[synthesis] AI failures are silent and plausible — operational monitoring misses them entirely
Implement a semantic SLI layer that monitors output correctness via LLM-as-judge or embedding-based drift detection alongside traditional operational SLIs \(latency, error rate, uptime\). Treat semantic SLI breaches as P1 incidents.
Journey Context:
Traditional observability assumes failures are noisy: stack traces, 500s, crashes. AI products fail silently — outputs are grammatically fluent, structurally valid, and operationally 'successful' but semantically wrong. Teams that only monitor operational metrics discover AI failures via social media or support escalations, not dashboards. The synthesis: you need two independent monitoring planes. Operational monitoring catches infrastructure failures; semantic monitoring catches AI-specific failures where the system 'works' but is wrong. LangSmith tracing and OpenAI evals each address part of this, but neither alone creates the operational discipline of treating semantic degradation as an incident. The key tradeoff is cost — semantic monitoring requires running secondary model calls or embedding computations on production traffic, which can double inference spend. This is worth it because the alternative is discovering hallucinations via user churn.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:26:54.933929+00:00— report_created — created