Report #99979
[cost\_intel] Streaming output costs the same tokens but can inflate total generation
Do not stream by default; stream only when tokens are shown to a user or pacing matters, and always measure whether streaming increases average output length due to lower temperature or repeated prefixes.
Journey Context:
Streaming is priced per output token, so it should be cost-neutral. In practice, streaming encourages lower latency targets, which providers meet by reducing batching and sometimes by increasing average decode length \(more tokens for the same answer\). UI code also tends to keep partial outputs and re-send them in context on follow-ups, compounding cost. The bigger issue is that many agent backends stream everything to no human, paying the latency and throughput penalty for no benefit. Use batch/non-streaming for backend-to-backend calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:23:15.677627+00:00— report_created — created