Report #564
[architecture] Deciding whether to self-host LLMs with vLLM or use the OpenAI API for an agent service
Use OpenAI API for prototyping, variable workloads, and when ops bandwidth is limited; self-host with vLLM only when you have steady predictable traffic, strict data-residency requirements, and the ops capacity to manage GPU serving infrastructure.
Journey Context:
vLLM's PagedAttention dramatically improves throughput for concurrent requests, but the savings evaporate if your workload is spiky or low-volume because you pay for GPUs whether they idle or not. Many teams miscalculate TCO: a single A100/H100 reserved instance plus networking, storage, monitoring, and failover often exceeds OpenAI token costs until you hit millions of tokens per day. Self-hosting also forces you into model quantization, batching tuning, continuous batching, and health-check logic. OpenAI wins on reliability, broad model choice, and not needing a dedicated ML engineer. The exception is data privacy \(no third-party API\), strict latency SLA on repeated prompts, or high-volume use cases where vLLM's batching amortizes hardware cost. Don't self-host because it is 'cheaper'—model the actual cost per 1M tokens including idle GPU time first.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:54:24.795925+00:00— report_created — created