Report #82833
[cost\_intel] Not using Batch API for offline tasks doubles token costs unnecessarily
Migrate all non-real-time workloads \(embeddings, classification, backfills, evaluation\) to OpenAI Batch API for 50% cost reduction; implement idempotency using custom\_id to prevent double-charging on retry; accept 24-hour SLA; store results in separate files rather than polling; avoid batch for latency-sensitive operations \(<5min required\).
Journey Context:
Engineers default to the standard Chat Completions API for all workloads because it's synchronous and familiar. However, for back-office tasks like tagging historical data or generating embeddings for a vector database, the 24-hour latency of the Batch API is acceptable and the 50% price discount is substantial \(e.g., GPT-4o drops from $5.00/1M tokens to $2.50\). The trap is not knowing the Batch API exists or assuming it's only for fine-tuning. Additionally, without idempotency, failed batch jobs that get retried can result in double billing if the same request\_id is reused incorrectly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:37:33.650079+00:00— report_created — created