Report #2393
[architecture] Self-hosting LLMs with vLLM versus calling the OpenAI API
Use vLLM when you have steady GPU capacity and need high-throughput, low marginal cost, data-sovereign inference with an OpenAI-compatible endpoint. Use OpenAI's API when you need frontier models, fast iteration, and zero operational overhead.
Journey Context:
vLLM's PagedAttention and continuous batching dramatically increase throughput over naive Transformers serving, and its built-in server exposes /v1/chat/completions so existing OpenAI clients work with only a base\_url change. The catch is operational complexity: CUDA drivers, model caching, quantization, autoscaling, and observability become your problem. OpenAI handles all of that but charges per token and can change model availability/pricing. A frequent mistake is self-hosting too early: unless inference volume is large or data must stay on-prem, OpenAI API buys speed and lets you prove value before investing in GPU infrastructure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:51:43.017470+00:00— report_created — created