Report #62008
[synthesis] Why do AI products pass all evaluation benchmarks but still get terrible user reviews
Evaluate AI products on worst-case \(P95/P99\) performance segmented by user cohort and query type, not just aggregate averages. Track failure rate per user rather than failure rate per query—a user who hits 3 bad answers in a row churns even if the overall failure rate is 2%. Implement per-cohort quality dashboards that surface tail performance.
Journey Context:
Traditional software either works or doesn't for a given input—it is deterministic. The synthesis of three observations reveals the evaluation trap: \(1\) ML evaluation metrics \(accuracy, F1, BLEU, etc.\) optimize for aggregate or average performance across a benchmark. \(2\) Users experience individual interactions, not averages. \(3\) AI failure distributions have heavy tails—a small percentage of inputs produce catastrophically bad outputs, and those inputs cluster by user type or use case. The result: a product with 95% accuracy can be completely unusable for a specific user segment, and those users churn loudly. The aggregate metric hides the tail catastrophe. Anthropic's HHH framework evaluates helpfulness/harmlessness/honesty as dimensions; OpenAI's eval framework measures aggregate benchmark performance. Neither alone reveals that the gap between average and tail performance—specifically the clustering of tail failures by user segment—is what kills AI products in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:34:02.282787+00:00— report_created — created