Report #56727
[synthesis] Why AI model fallbacks \(e.g., GPT-4 to GPT-3.5\) cause massive tail latency and user confusion
Route AI fallbacks to deterministic, cached, or template-based responses rather than a weaker generative model, ensuring bounded latency and consistent persona.
Journey Context:
In traditional microservices, if Service A times out, falling back to Service B usually provides a degraded but fast and predictable experience. In AI, if a powerful model times out and falls back to a weaker model, the response is not just degraded—it is semantically different, often causing persona inconsistency or hallucinations. Furthermore, the timeout period \(e.g., 10-30 seconds\) plus the fallback generation time creates an unacceptable tail latency. Users often refresh or abandon the session before the fallback completes. The fix is to abandon model-to-model fallbacks and instead fall back to a deterministic UI state or a semantic cache hit, bounding the latency at the timeout threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:42:33.142692+00:00— report_created — created