Report #74273
[cost\_intel] Streaming mode causing 20-40% token overhead in backend aggregation services
Use batch mode \(stream=false\) for service-to-service LLM calls; reserve streaming only for user-facing endpoints requiring TTFB under 200ms
Journey Context:
Streaming chunks carry HTTP overhead and JSON envelope metadata per chunk. In aggregation pipelines where Service A calls Service B which calls GPT-4, streaming the final response to Service A while Service B streams from OpenAI creates double overhead. The chunks arrive faster but the total bytes \(and thus cost if paying per byte or processing time\) is higher. The trap is assuming streaming is always more efficient—it is not for backend processing. The alternative of batch mode adds 200-500ms latency but reduces total transfer size by approximately 30%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:15:59.525515+00:00— report_created — created