Report #351
[architecture] When does self-hosting LLMs with vLLM beat using the OpenAI API?
Use vLLM when you have steady, high-volume inference, need sub-API latency/throughput tuning, or must keep prompts/responses on your own GPUs. Start with vLLM's OpenAI-compatible server and quantize \(AWQ/GPTQ/FP8\) to fit models on cheaper GPUs. Do not self-host if your workload is bursty or low-volume: idle GPU cost will exceed OpenAI pay-per-use.
Journey Context:
vLLM's PagedAttention treats the KV cache like virtual memory, enabling continuous batching and near-100% memory utilization; it routinely outperforms naive Hugging Face serving by 2-4x on batch workloads. It supports tensor/pipeline parallelism for large models and prefix caching for repeated prompts. The trade-off is operational: CUDA drivers, model serving, autoscaling, monitoring, and security are now yours. OpenAI API removes infra but costs per token and sends data off-prem. Many teams run vLLM for base/open models and still call OpenAI for frontier capability. Watch prefill-vs-decode latency and use chunked prefill where it helps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T05:40:20.253731+00:00— report_created — created