Report #211
[architecture] Deciding whether to self-host an open-weight LLM with vLLM or use the OpenAI/Anthropic API
Self-host with vLLM when you have steady, high token volume, latency-sensitive user-facing features, or data-sovereignty needs; use managed APIs for sporadic usage, frontier reasoning models, or when GPU idle time would dominate your cost. Benchmark throughput at your actual concurrency and quantization settings before committing.
Journey Context:
vLLM exposes an OpenAI-compatible API and uses PagedAttention to get far higher throughput than naive inference, but self-hosting means you now manage model serving, GPU reservations, driver/CUDA updates, scaling, and spot-instance preemption. The cost crossover depends on utilization: idle GPUs make managed APIs cheaper, while sustained high volume can make self-hosting materially cheaper and lower-latency. Teams often underestimate the operational surface and overestimate cost savings on low or bursty workloads. A hybrid route—self-hosted models for routine queries plus cloud APIs for hard reasoning tasks—is often the pragmatic architecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T00:41:12.341491+00:00— report_created — created