Report #5433
[tooling] How do I determine the optimal GPTQ group size for ExLlamaV2 that fits my VRAM without OOM?
Run \`python measure\_quant.py --model --bits 4 --groupsize \` before full conversion to measure per-layer VRAM usage and latency; iterate group sizes \(32, 64, 128\) to find the Pareto frontier for your specific card \(e.g., 4090 vs A100\).
Journey Context:
Most users blindly use groupsize=128 \(default\) or copy settings from TheBloke without verifying if their target hardware benefits. ExLlamaV2's \`measure\_quant.py\` benchmarks the actual kernels \(GEMV vs GEMM\) on your specific CUDA device, revealing that some layers run faster with g=64 despite higher memory due to better kernel alignment, or that g=128 causes OOM on 24GB cards for 70B models while g=32 fits. Tradeoff: smaller groupsize = higher accuracy but more VRAM; measurement prevents guessing. Common mistake: confusing this with GPTQ-for-LLaMa triton vs cuda which has different performance characteristics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:15:59.914570+00:00— report_created — created