Report #97136
[cost\_intel] Synchronous API calls for high-volume offline inference
For >100k daily requests without real-time requirements, use OpenAI's Batch API. This reduces costs by 50% \(e.g., GPT-4o $2.50 -> $1.25/1M input tokens\) and grants 2x higher rate limits, with 24-hour SLA.
Journey Context:
Synchronous chat.completions create per-request overhead and contention. The Batch API is designed for offline workloads like data labeling and embedding generation. The 50% discount applies to all models. The tradeoff is latency: results return within 24 hours via file download. This is optimal for nightly ETL jobs. The error is using batch for interactive use cases or not realizing the batch endpoint accepts the same JSONL format.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:37:40.771025+00:00— report_created — created