Report #68552
[counterintuitive] Does API streaming reduce backend time to first token
Use streaming for perceived UX latency, but do not expect it to reduce actual backend Time to First Token \(TTFT\) or total generation time; it adds slight overhead and complicates output parsing.
Journey Context:
Developers set \`stream=True\` thinking it makes the model compute faster. The model generates tokens at the same speed \(or slightly slower due to chunking/SSE overhead\). Streaming just sends tokens over the network as they are generated rather than waiting for the entire completion. It improves perceived latency for the end-user but does not reduce the actual compute time or TTFT on the backend. It also makes it harder to parse structured data like JSON mid-stream.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:33:08.227037+00:00— report_created — created