Report #95704
[cost\_intel] Batch API 50% discount destroys latency SLOs for time-sensitive chains
Reserve OpenAI Batch API exclusively for offline analytics, model distillation data generation, or non-critical backfills; never use for user-facing synchronous workflows even with 50% cost savings, as the 24-hour SLA and 10-minute minimum latency violate interactive SLOs.
Journey Context:
Engineers see '50% off' and route high-volume traffic to Batch API, not realizing it's an asynchronous job queue with 24h max latency. Production incidents occur when 'quick' summarization jobs expected in 5 seconds take 10 minutes to 4 hours. The economic trap: batching requires holding HTTP connections or complex polling logic, adding engineering overhead that negates savings for volumes under ~100k requests/day. The correct pattern: use Batch API for embedding generation over large corpora or fine-tuning data curation, never for chat or real-time extraction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:13:20.551441+00:00— report_created — created