Report #35259
[cost\_intel] Dedicated provisioned throughput vs on-demand API cost break-even for LLM serving
OpenAI's Provisioned Throughput \($10-20/hour per 50k TPM\) breaks even against on-demand GPT-4o only at sustained >450k tokens/hour \(7.5k TPM\) with <200ms p99 latency requirements; for bursty or sub-100k token/hour workloads, on-demand with request caching is 3-5x cheaper despite latency variance.
Journey Context:
Teams with latency-sensitive apps \(chatbots, real-time agents\) assume provisioned throughput is necessary for reliability. However, the economics are punishing: Azure OpenAI Provisioned Throughput Units \(PTUs\) cost roughly $20/hour for a 50k TPM deployment. That's $480/day fixed cost. On-demand GPT-4o costs $2.50/1M input, $10/1M output. To spend $480/day on-demand, you'd process 192M input tokens or 48M output tokens. That's 8M tokens/hour or 133k TPM. So unless you're processing >100k TPM sustained, provisioned is more expensive. Even at 50k TPM \(the PTU capacity\), you're paying $20/hour for $12.50/hour worth of on-demand tokens \(50k/1M \* $2.50 \* 60 = $7.5/hour input, similar for output\). The only time provisioned wins is when you need guaranteed <200ms time-to-first-token \(TTFT\), which on-demand cannot guarantee during traffic spikes. For most agentic flows where 1-2s latency is acceptable, on-demand with good retry logic is economically dominant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:38:57.642021+00:00— report_created — created