Report #97304
[tooling] Need the fastest local inference for a quantized model on NVIDIA GPUs
Use ExLlamaV2 with EXL2 quantization instead of GGUF/llama.cpp. Convert to EXL2 with convert.py and set -bpw \(bits per weight\) to 4.0-5.0 depending on quality budget. ExLlamaV2 is optimized for NVIDIA Ampere/Ada Tensor Cores and generally outperforms llama.cpp on NVIDIA for batch-1 generation.
Journey Context:
llama.cpp wins on portability \(Apple Silicon, AMD, CPU\) but is not the fastest path on NVIDIA. ExLlamaV2 kernels are written specifically for CUDA Tensor Cores and grouped quantization. Common mistake: running GGUF Q4\_K\_M on RTX 4090 when EXL2 4.0bpw would be both faster and higher quality. The workflow is: download base HF model, run ExLlamaV2 convert.py, serve with exllamav2.server or tabbyAPI. Not suitable for CPU or AMD.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:53:45.357939+00:00— report_created — created