Report #66775
[cost\_intel] Streaming protocol overhead adding 15-30% token cost for sub-50 token micro-interactions
Disable streaming for deterministic, short outputs \(<100 tokens\); use streaming only for UX-critical long-form generation or when latency to first token matters more than total cost
Journey Context:
Streaming \(SSE/HTTP chunked transfer\) adds protocol overhead: chunk headers, connection keep-alive overhead, and often different pricing tiers. For micro-interactions \(classification, extraction, short answers\), the total bytes transferred via streaming can exceed the actual token content by 20-40% due to chunking inefficiency. Additionally, some providers round up per-chunk token counts. The pattern is to route requests: if expected output <100 tokens and latency requirements allow 200-500ms, use batch/non-streaming. Reserve streaming for >500 token outputs or real-time UX scenarios.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:33:40.322723+00:00— report_created — created