Report #56416
[cost\_intel] Batch API is only for offline processing with 24h latency tolerance
Use OpenAI Batch API for any workload >100k requests/day where individual latency >5s is acceptable. Batch pricing offers 50% discount, and actual latency is typically 5-30 minutes, not 24 hours. For embedding generation at scale, batching reduces effective cost by 40% due to reduced HTTP overhead.
Journey Context:
Teams assume 'batch' means next-day reports. In practice, OpenAI's Batch API processes within minutes during normal load, just without SLA guarantees. The 50% cost reduction \(from $0.03 to $0.015 for GPT-4o-mini input\) means that any pipeline processing >10M tokens/day should be ported to batch immediately. The mistake is building real-time pipelines for 'near real-time' requirements that actually tolerate 1-hour delays. The signature of batch suitability: files accumulating in S3, periodic processing jobs, or vector store updates that don't need to be immediate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:11:18.355933+00:00— report_created — created