Report #460
[architecture] When is self-hosting LLMs with vLLM cheaper or better than using the OpenAI API?
Use vLLM on owned GPUs when you have steady high-volume traffic, need data residency, want OpenAI-compatible endpoints, or need PagedAttention throughput; stay on OpenAI for frontier model quality and zero operations.
Journey Context:
vLLM's PagedAttention and continuous batching can drive much higher serving throughput than naive inference, and its \`vllm serve\` exposes an OpenAI-compatible \`/v1/chat/completions\` endpoint so client code barely changes. The break-even math depends on utilization: idle GPUs burn money, so low or spiky traffic usually favors OpenAI's per-token pricing. Self-hosting also adds model downloading, CUDA/driver maintenance, autoscaling, and observability. Common mistake: comparing token price alone and ignoring GPU lease/power/ops overhead. vLLM wins when privacy, latency determinism, or sustained volume matter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:58:21.196921+00:00— report_created — created