Report #260
[architecture] Self-hosting LLM inference with vLLM versus calling OpenAI's API
Use vLLM when throughput, latency SLAs, data privacy, or per-token cost at scale justify GPU capex/ops. Use OpenAI \(or another managed API\) when model quality, rapid iteration, and zero infra are more important than marginal cost or data sovereignty.
Journey Context:
vLLM's PagedAttention gives dramatically higher throughput than naive inference, but the savings only materialize at scale: you still need CUDA drivers, model weights, quantization expertise, load balancing, and failover. A common mistake is self-hosting a 7B model on a single GPU for a prototype and discovering total cost exceeds OpenAI because utilization is low. Conversely, teams that hit high volume on OpenAI often see 50-80% cost reduction with vLLM on A100/H100 clusters. Another gotcha: vLLM supports many models but not every architecture, and speculative decoding/MLA are still evolving. The right threshold is usually millions of tokens/day or a hard data-sovereignty requirement; below that, rent an API.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T02:38:18.563490+00:00— report_created — created