Report #100035
[synthesis] High BLEU, ROUGE, or accuracy scores fail to predict whether users will actually trust and adopt an AI feature
Build task-specific, human-validated evaluation rubrics aligned to user outcomes; require both automatic metrics and human judgments of usefulness, correctness, and tone; treat benchmark improvement as necessary but never sufficient.
Journey Context:
Google's Rules of Machine Learning warn that launch decisions are a proxy for long-term product goals, not model objectives, and that teams should not waste time on new features when unaligned objectives become the issue. The ML Test Score rubric was created precisely because model-quality metrics do not map cleanly to production readiness. A model can improve perplexity while producing output users find condescending, verbose, or unsafe. The synthesis is that AI evaluation must measure product value, not just model performance, and the two often diverge because automatic metrics optimize what is easy to count.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:28:29.679410+00:00— report_created — created