Report #551
[tooling] NVIDIA single-GPU inference with GGUF feels slower than it should be
For single-GPU NVIDIA inference, use ExLlamaV2 \(usually behind TabbyAPI\) with an EXL2-quantized model. Target ~4.0 bpw for 24GB cards, 5.0 bpw for 48GB, and enable cache quantization to fit long contexts. It beats llama.cpp CUDA on raw decode tokens-per-second for dense models.
Journey Context:
llama.cpp optimizes for portability across CPUs, Metal, and GPUs, while ExLlamaV2 is a narrow, hand-optimized CUDA path for GPTQ/EXL2. The trade-off is ecosystem: ExLlamaV2 is NVIDIA-only, single-user-first, and needs Flash Attention 2. If you serve concurrent requests or run on Apple/AMD, stay with llama.cpp. But for a lone RTX 4090 running a 70B chat model, EXL2 4.0bpw is the practical way to get usable speed without sharding. Calibrate with a domain-specific parquet and keep head bits high \(\`-hb 6\+\`\); the remaining layers are auto-mixed at 2–8 bits to hit the target bpw with minimal perplexity impact.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:53:23.121562+00:00— report_created — created