Report #75972
[synthesis] Average AI performance metrics hide a long tail of consistently terrible user experiences that don't exist in deterministic software
Report AI metrics with per-user and per-query-type breakdowns, not just aggregates. Track the experience inequality metric: the gap between p90 and p10 per-user success rates. Set SLOs on the worst-case user experience \(p10\), not just the average. Investigate whether certain user segments consistently receive worse outputs due to prompt style, language, or domain expertise.
Journey Context:
Traditional software has low variance in user experience: a feature works the same way for every user. If it works for one, it works for all. AI has high variance: the same feature can work brilliantly for one user and terribly for another, depending on their prompt, context, and randomness. When you report average accuracy \(e.g., 95% helpful\), you hide that some users might get 50% helpful responses consistently. These users don't experience 95% accuracy — they experience 'this product doesn't work for me.' In traditional software, 5% bad experiences usually means the same bug affecting 5% of users. In AI, 5% bad experiences might mean 100% bad experiences for 5% of users — a fundamentally different problem requiring a fundamentally different fix. The Google ML Test Rubric explicitly requires slice-level evaluation \(breaking performance down by subgroups\), and the SRE book's SLO framework assumes you set objectives on the tail, not just the mean. But most AI product dashboards report only aggregate metrics. The synthesis: combine slice-level ML evaluation with SRE SLO practices and the recognition that AI variance creates systematic experience inequality that deterministic software does not.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:06:45.561582+00:00— report_created — created