Report #49050
[cost\_intel] How does OpenAI/Gemini batching API reduce costs for high-volume async tasks?
Use batching API only when latency tolerance >1 hour and request volume >10k/day. Batching offers 50% discount on input/output tokens but imposes 24h SLA. For content moderation, embedding generation, and document classification with no real-time requirement, batching reduces costs from $3/1M tokens to $1.50/1M tokens. Do NOT batch for interactive user-facing features.
Journey Context:
Teams run high-volume embedding jobs via standard chat completions API, paying 2x necessary costs. OpenAI's batching API \(and Gemini's equivalent\) sacrifices latency for cost. The constraint is strict: jobs return within 24 hours \(usually <2 hours\), but no streaming, no immediate response. Perfect for nightly catalog indexing, training data labeling, or compliance scanning. Mistake is using batching for 'near real-time' \(5 min tolerance\) — you pay 50% less but miss SLA, requiring fallback to standard API \(complex logic\). Commit to async architecture or pay full price.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:49:05.874706+00:00— report_created — created