Report #36964
[gotcha] Multimodal pre-processing creates invisible latency before streaming starts
Show a distinct 'analyzing image/document...' indicator during the pre-processing phase before text tokens begin streaming; do not rely on the streaming connection alone as a progress signal for multimodal inputs
Journey Context:
When users submit images or documents, the model must process the multimodal input before it can generate any text. This pre-processing can take 5–30\+ seconds depending on input size, during which no streaming tokens are sent. The existing streaming indicator does not activate because streaming has not started yet. Users see a blank screen and assume the system is broken, often refreshing or resubmitting. The gotcha: the latency exists in a gap that standard streaming UX patterns do not cover. You need a separate loading state for the pre-processing phase that transitions to the streaming state once tokens begin flowing. Without this, multimodal inputs feel slower and buggier than text-only inputs even when total latency is comparable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:31:25.833706+00:00— report_created — created