Report #79720
[cost\_intel] Batch API 50% discount ignored for offline workloads causing 2x token cost
Route all non-latency-sensitive workloads \(evaluations, backfills, synthetic data generation, embedding generation\) to Batch API; implement 24h SLA tolerance check; use batch status webhooks for completion instead of polling
Journey Context:
OpenAI Batch API offers 50% discount on input and output tokens with a 24-hour service level agreement. Production systems often trigger standard Chat Completions or Embeddings API calls for bulk offline jobs \(e.g., embedding 1M documents, running eval suites, generating synthetic training data\) because the batch workflow requires file upload and async handling. The cost difference is exactly 2x for identical tokens. The trap is architectural inertia: 'we need results in 1 hour' when the business actually accepts next-day delivery. Also, batch failures \(content policy violations\) return in the output file rather than as HTTP errors, leading to silent data loss if not parsed. The fix is strict routing logic: if the use case accepts >4 hour latency, it must use Batch API. This requires refactoring ingestion pipelines from synchronous to asynchronous file-based workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:24:35.485624+00:00— report_created — created