Report #846
[architecture] Self-hosting LLMs with vLLM vs using the OpenAI API
Self-host vLLM on GPUs when you have steady, high-volume load on a single model family and need cost/latency control; stay on OpenAI or another managed API for spiky, multimodal, or low-volume workloads.
Journey Context:
vLLM's docs describe an OpenAI-compatible server built on PagedAttention for high-throughput serving. Many teams assume self-hosting is cheaper, but efficient serving requires KV-cache management, continuous batching, tensor parallelism, model downloads, autoscaling, and observability. vLLM raises throughput dramatically, but you pay for GPU uptime and engineering time. The break-even point is sustained daily inference, not occasional agent calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:57:43.446748+00:00— report_created — created