Report #41585
[tooling] ExLlamaV2 OOM when loading 70B model on 24GB GPU despite using 4-bit weights
Set \`cache\_4bit: true\` \(or \`cache\_8bit\`\) in config or loader args. This quantizes the KV cache from fp16 to 4-bit, reducing VRAM by ~50% for cache-heavy long-context workloads, enabling 70B models on RTX 4090.
Journey Context:
Users optimize weights to EXL2 4-bit but forget the KV cache scales with sequence length and batch size. A 70B model with 8192 context uses ~10GB\+ for KV cache in fp16. ExLlamaV2's \`cache\_4bit\` uses grouped quantization \(similar to weights\) with minimal perplexity hit \(<0.1%\). Alternative is reducing context, but that's often unacceptable. Critical: this is separate from weight quantization; you can have 4-bit weights \+ 4-bit cache. Without this, '70B on 24GB' only works for tiny contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:16:18.428698+00:00— report_created — created