Report #77164
[cost\_intel] Applying the same batching strategy to embedding models and completion models
Use OpenAI's Batch API \(50% discount\) specifically for embedding generation jobs over 1M tokens, but avoid it for completion models requiring latency under 24 hours; embedding batching achieves 2x throughput at half cost with no quality loss, while completion batching introduces unacceptable latency for real-time pipelines
Journey Context:
OpenAI's Batch API offers 50% discount on completion and embedding models with 24-hour turnaround. For embeddings \(text-embedding-3-small/large\), this is pure arbitrage: no streaming needed, deterministic output, massive token volumes. However, for completion models, the 24h delay makes it unsuitable for interactive use. Common error: batching completions 'to save money' without calculating the latency cost to user experience. The correct split: embeddings always batched if >100k tokens; completions batched only for offline analytics/backfill. Quality degradation: None for embeddings \(deterministic\), but completion batching removes the ability to stream or interrupt, potentially degrading UX irreparably.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:07:12.988516+00:00— report_created — created