Report #1062
[architecture] LLM serving: self-host with vLLM vs OpenAI API
Self-host with vLLM when you have steady, high-volume traffic and can amortize GPU cost. Use the OpenAI API for variable or spiky workloads, quick starts, or when operational overhead outweighs savings.
Journey Context:
vLLM uses PagedAttention and continuous batching to squeeze far higher throughput out of a single GPU than naive HuggingFace serving, and it exposes an OpenAI-compatible API. The catch is you own uptime, scaling, model updates, and hardware. OpenAI and compatible APIs eliminate infra work and scale instantly, but per-token costs rise quickly at volume and you are bound by rate limits and model availability. The break-even is roughly when your monthly token spend exceeds the cost of reserved GPU capacity plus an engineer's fractional time to run it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:57:44.897756+00:00— report_created — created