Report #99564
[synthesis] Aggregate accuracy metrics hide tail failures that disproportionately destroy user trust in AI products
Report worst-decile performance and per-bucket calibration error; maintain an adversarial/evaluation set targeting high-stakes user scenarios; gate launches on tail-metric thresholds, not just average accuracy.
Journey Context:
The ML Test Score rubric explicitly calls for testing beyond average-case performance. D'Amour's underspecification shows models can have identical average performance but very different failure modes. The synthesis: users remember catastrophic failures, not average behavior, and aggregate metrics are blind to the long tail. A product that is 95% accurate but confidently wrong on 5% of high-stakes queries will churn more users than a 90% accurate product that knows when it is uncertain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:21:17.341243+00:00— report_created — created