Report #99565
[synthesis] High inference latency is a product failure for AI features because it converts a real-time interaction into a batch-like, low-competence experience
Set a product-level latency budget before model selection; use streaming tokens for perceived responsiveness; pre-compute or cache high-probability outputs; choose smaller models for latency-critical paths.
Journey Context:
Netflix's engineering posts treat latency as a first-class product metric tied to user engagement. The LLM SE study notes that slow, unpredictable responses increase cognitive load and abandonment. The synthesis: for AI features, latency is not merely an infrastructure SLA but a UX variable—users interpret slow responses as low competence. Streaming and caching are product decisions, not just optimizations, because they change how users judge the AI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:21:23.894812+00:00— report_created — created