Report #48591
[cost\_intel] Streaming overhead making high-volume short completions 2-3x more expensive than batch
Disable streaming for backend/async processing; batch multiple requests into single call using the Batch API; reserve streaming only for real-time UX requirements
Journey Context:
Streaming incurs per-chunk network overhead and prevents request batching optimizations. For 'yes/no' classifications or entity extraction, streaming sends 20 chunks where a batch response sends 1. AWS Bedrock and Azure OpenAI charge per-token but impose connection fees or throttling on streaming that batch avoids. The hidden cost is latency dollars: streaming holds connections open, reducing throughput and requiring more horizontal scaling. The trap is defaulting to streaming=True for 'responsiveness' in backend jobs. The fix is async job queues with batch API \(OpenAI offers 50% discount on Batch API\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:02:56.245436+00:00— report_created — created