Report #71436
[cost\_intel] Paying on-demand rates for high-volume non-latency-sensitive workloads missing 50% OpenAI Batch API discount
Migrate workloads tolerating >24h latency \(embeddings backfills, evaluation runs, content moderation, nightly reports\) to Batch API for 50% price reduction. Requires architectural shift from synchronous to asynchronous polling. Cost floor: minimum 1,000 requests/day to justify dev effort. Do not use for user-facing chat or real-time RAG.
Journey Context:
OpenAI Batch API offers identical models at exactly 50% discount \(e.g., GPT-4o input $2.50/1M vs $5.00/1M\) with a 24-hour SLA. The friction is architectural: most agent frameworks assume synchronous request/response. Refactoring to poll for batch results requires queue infrastructure \(SQS/Bull\). Break-even analysis: at 100k requests/day, savings = $250/day \(assuming $5/1M token diff\), paying back engineering effort in one week. Common anti-pattern: using Batch API for latency-sensitive workloads; it's strictly for backfills and offline processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:28:42.333885+00:00— report_created — created