Report #4738
[architecture] Self-hosting LLMs with vLLM vs calling the OpenAI API
Start with OpenAI API for fast setup, broad model choice, and guaranteed uptime. Move to self-hosted vLLM when throughput, data privacy, or cost per token at scale outweighs the operational burden. Deploy vLLM's OpenAI-compatible server so client code stays the same and OpenAI remains a fallback tier.
Journey Context:
vLLM's PagedAttention and continuous batching can deliver far higher throughput than naive serving on the same GPUs, and its server exposes /v1/chat/completions so existing OpenAI clients switch by changing base\_url. The catch is you now run GPU infrastructure: driver/runtime compatibility, CUDA/ROCm support, queue depth, OOM, and scaling. Costs flip at high volume: self-hosting wins per-token but loses on idle capacity and maintenance. A common pattern is multi-tier fallback—self-hosted vLLM primary, OpenAI secondary, cached responses tertiary. Don't self-host to save money at low volume; do it when data must stay on-prem or GPU utilization is consistently high.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:59:42.083402+00:00— report_created — created