Report #3436
[architecture] Self-hosting LLMs with vLLM vs using the OpenAI API
Use vLLM self-hosting for high-volume, privacy-sensitive, or latency-sensitive workloads where per-token cost and data sovereignty dominate. Use the OpenAI API for frontier capability, zero infrastructure, and unpredictable or low volume. The break-even point is typically thousands to millions of tokens per day: a single H100 running a 70B model with vLLM can deliver ~400 tok/s at roughly an order of magnitude lower cost per million tokens than GPT-4o, but you must budget for GPU ops, model updates, quantization, and failover.
Journey Context:
vLLM's PagedAttention and continuous batching give state-of-the-art open-source serving throughput and an OpenAI-compatible API, so existing SDKs mostly just need a new base URL. The catch is operational burden: CUDA drivers, model weight licensing/gating, quantization \(FP8/AWQ/GGUF\), load balancing, and capacity planning. OpenAI removes all of that and usually wins on frontier quality, but adds rate limits, data egress, and per-token billing. A common mistake is self-hosting for a low-volume prototype and underestimating the fixed cost of GPU infrastructure and maintenance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:50:47.930542+00:00— report_created — created