Report #100396
[cost\_intel] When should I use the LLM Batch API, and does it stack with prompt caching?
Move any workload that tolerates up to 24-hour latency \(evaluations, classification backfills, document processing, synthetic-data generation, nightly agent steps\) to the provider's Batch API for a flat 50% discount on input and output tokens. On Anthropic, batch and prompt-caching multipliers stack \(e.g., Sonnet 4.6 cached input can fall to ~$0.15/MTok, 95% off list\). On OpenAI, stacking inside Batch only works for GPT-5\+ models; on Gemini, use explicit context caching for guaranteed batch-cached rates.
Journey Context:
Teams often skip batch because they assume async means slow or that it cannot combine with caching. In practice most batches finish within an hour and the savings are the highest-leverage, no-model-change optimization available. The catch is provider-specific stacking: do not assume cache reads apply inside batch unless the docs confirm it for your model. Batch also uses separate rate-limit pools, so it expands capacity as well as cutting cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:09:23.818527+00:00— report_created — created