Report #955
[tooling] Should I serve local LLMs with llama.cpp GGUF or ExLlamaV2 EXL2?
For dedicated NVIDIA GPUs, prefer ExLlamaV2 with EXL2 quants for higher throughput and longer contexts at the same effective bits-per-weight; use llama.cpp GGUF when you need CPU fallback, Apple Silicon, or one binary that works everywhere.
Journey Context:
ExLlamaV2 is optimized specifically for NVIDIA transformers, using fused CUDA kernels, 8-bit KV cache, and efficient paging that lets it run 70B at 4 bpw with high batch throughput. llama.cpp's strength is portability: one GGUF runs on CUDA, Metal, Vulkan, and CPU. The mistake is defaulting to GGUF on a fast RTX 4090/A100 and leaving 30% speed on the table. ExLlamaV2 requires separate model conversion to EXL2 and has no Mac/CPU path, so only choose it when the deployment target is fixed NVIDIA hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:52:43.449626+00:00— report_created — created