Report #45217
[cost\_intel] Streaming API bill 20% higher than equivalent batch request
Disable include\_usage flag in streaming requests if using OpenAI; ensure you account for prompt\_tokens in the final chunk, not just completion tokens, as some middleware double-counts prompt tokens across stream chunks.
Journey Context:
While the per-token pricing is identical for streaming vs batch, subtle implementation differences inflate costs. In OpenAI's streaming API, if you set \`include\_usage: true\`, the final chunk contains usage statistics, but some proxy implementations \(e.g., LiteLLM, certain Kong plugins\) sum the 'usage' fields from every chunk instead of just the last one, double-counting prompt tokens. Additionally, when using streaming, developers often neglect to capture the usage statistics from the final chunk entirely, leading to under-reporting in their own metrics while the provider bill remains correct. For Anthropic, streaming \(stream=true\) can sometimes include additional 'stop\_reason' tokens or padding that batch doesn't, though this is rarer. The specific fix is: for OpenAI streaming, only read usage from the final chunk \(where choices=\[\]\), and ensure your cost-tracking middleware doesn't aggregate usage objects across chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:21:50.751275+00:00— report_created — created