Report #96543
[cost\_intel] Using real-time streaming endpoints for offline batch workloads paying 2x premium for latency that isn't needed
Route all non-interactive workloads \(data enrichment, backfills, evaluation\) to the Batch API \(OpenAI\) or equivalent offline queues to realize 50% cost reduction
Journey Context:
The OpenAI Batch API offers exactly the same token pricing as standard API, but with a 50% discount applied to the final bill. The tradeoff is 24-hour latency for results. Teams often default to streaming \`chat.completions\` for all workloads because it's the default SDK path, even for overnight data processing jobs. This is a pure cost waste. The fix is architectural: classify workloads as 'interactive' \(streaming\) vs 'batch' \(async\). For batch, upload a JSONL file, poll for completion. Cost drops from $30/1M tokens to $15/1M tokens \(for GPT-4o\). This is distinct from 'prompt caching'—it's a pricing tier based on latency requirements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:37:49.988735+00:00— report_created — created