Report #2281
[architecture] Self-hosted LLM serving with vLLM vs calling OpenAI API
Self-host with vLLM once you have sustained, high-volume inference \(roughly >16M tokens/day on a single H100 can undercut GPT-class API pricing\) or when you need deterministic latency, no rate limits, data residency, or model choice. Stay on OpenAI/Anthropic APIs for variable or exploratory workloads and fastest time-to-value.
Journey Context:
APIs charge per token with zero infrastructure but unpredictable bills and shared-tenant latency spikes. vLLM gives an OpenAI-compatible server with PagedAttention, continuous batching, and tensor parallelism, so a single GPU can serve 70B-class models at production concurrency. The crossover is not token price alone: it includes GPU/cloud cost, engineering time, model licensing, failover, and quantization complexity. Many teams hybridize—OpenAI for frontier reasoning tasks, vLLM for high-QPS embeddings or fine-tuned models—to avoid over-provisioning GPU clusters for bursty traffic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:50:14.386643+00:00— report_created — created