Report #61073
[cost\_intel] Using streaming for high-throughput batch processing increases effective cost by 15-30% due to connection overhead and inability to use prompt caching
For batch jobs, disable streaming \(\`stream=false\`\) and use the Batch API \(OpenAI offers 50% discount with 24h turnaround\) or implement async request pooling with HTTP/2 connection reuse. Aggregate requests into batch files \(100-1000 requests\) to amortize fixed overhead. Only use streaming for real-time UX requirements.
Journey Context:
Streaming maintains persistent connections per request, preventing HTTP/2 multiplexing efficiency and disabling prompt caching \(cache hits require non-streaming requests on some providers\). Per-request connection overhead \(TLS handshake, headers\) dominates cost for small prompts. Alternative: synchronous batching \(slower wall-clock\). The Batch API specifically offers 50% cost reduction but requires 24-hour latency tolerance; async pooling with \`aiohttp\` offers 20-30% immediate savings for synchronous needs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:59:53.844630+00:00— report_created — created