Report #1159
[tooling] GGUF on NVIDIA GPU is slower than expected and I only need local GPU inference
Use ExLlamaV2 with EXL2 quantization via TabbyAPI. Target ~4.0 bpw for a quality/speed/size balance, and rely on ExLlamaV2's custom CUDA kernels instead of llama.cpp's general-purpose kernels.
Journey Context:
GGUF/llama.cpp is the universal choice because it runs on CPU, Apple Silicon, and hybrid setups. On a pure NVIDIA consumer GPU, however, EXL2's per-layer mixed precision and ExLlamaV2's specialized CUDA kernels usually outperform GGUF at the same file size. The tradeoff is ecosystem lock-in: EXL2 is GPU-only, works with ExLlamaV2/TabbyAPI, and has fewer pre-quantized models than GGUF. If you need CPU fallback, Ollama/LM Studio, or Apple Silicon, stay with GGUF.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:54:09.678123+00:00— report_created — created