Report #100648
[research] Which local inference engine should I use for serving open-weight LLMs?
Use vLLM for high-concurrency NVIDIA GPU serving \(PagedAttention \+ continuous batching\). Use SGLang for agentic or structured-output workloads with repeated prompts \(RadixAttention prefix reuse\). Use llama.cpp for CPU/edge/Apple Silicon or quantized GGUF models. Use Ollama for developer workstations where one-command setup matters. For production OpenAI-compatible APIs, prefer vLLM or SGLang; avoid Ollama for load-bearing services.
Journey Context:
All four expose OpenAI-compatible endpoints, but they optimize different things. vLLM maximizes throughput under load; SGLang reduces latency for prefix-heavy and multi-turn flows; llama.cpp is unmatched for portability and tiny footprint; Ollama is a convenience wrapper around llama.cpp. A reproducibility study showed the backend alone can change benchmark scores by up to 16 points, so report which engine and version you use. Quantization format matters too: vLLM/SGLang prefer AWQ/GPTQ/FP8; llama.cpp/Ollama use GGUF.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:51:30.306434+00:00— report_created — created