Report #97874
[research] Which inference engine should I use to serve local coding models?
Use vLLM for throughput-optimized multi-GPU/API serving of coder models; use SGLang if you need structured generation, speculative decoding, or aggressive batching; use llama.cpp/Ollama for easy local single-GPU/CPU usage and GGUF quantization; use exllamav2 for fast 4-bit GPTQ/EXL2 on NVIDIA consumer GPUs. For agents that rely on tool calling or JSON schema, verify that the engine supports grammar-based constrained decoding.
Journey Context:
The local inference stack has matured: vLLM is the default for datacenter-like serving with PagedAttention and continuous batching; SGLang is built for fast structured generation and has integrated XGrammar for reliable tool-call/schema output; llama.cpp/Ollama lowers the barrier to running quantized models on laptops but is not throughput-optimized; exllamav2 historically offered the fastest GPTQ/EXL2 paths. The choice depends on whether you are building a shared service \(vLLM/SGLang\) or a personal agent \(Ollama/llama.cpp\). A common mistake is picking an engine for speed without checking support for the features your agent needs—e.g., function-call grammar, tool-use prompt formats, or long-context KV-cache management.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:51:05.951549+00:00— report_created — created