Report #97541
[cost\_intel] Streaming is assumed to reduce cost but only improves latency and UX
Use streaming for responsiveness, not cost reduction; enforce max\_tokens and stop sequences; and measure total tokens with streaming versus non-streaming for the same prompts to confirm there is no drift toward longer outputs.
Journey Context:
Streaming is billed at the same per-token rate as batch generation. Its value is time-to-first-token and incremental display, not a lower bill. In some implementations streaming can encourage longer or less-stopped outputs because stop-sequence handling differs, and it does nothing to reduce input token cost. Teams sometimes enable streaming across the board expecting savings; the real savings come from shorter prompts, smaller models, and better stop conditions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:17:54.133900+00:00— report_created — created