Report #95558
[tooling] ExLlamaV2 OOMs when loading 70B model on 48GB GPU despite model fitting
Enable KV cache quantization by setting \`cache\_q4=True\` \(or \`cache\_q8=True\`\) in \`ExLlamaConfig\` or via CLI \`--cache-q4\`, reducing VRAM usage by ~30% with minimal perplexity hit, allowing 70B@4bpw to run on 48GB.
Journey Context:
ExLlamaV2 stores the KV cache in FP16 by default, which for 70B with 8k context consumes ~16GB VRAM. Users quantize the weights to Q4 but forget the cache. The \`cache\_q4\` option quantizes the cache keys/values to 4-bit \(using the same quants as weights\), fitting 70B into 48GB \(24GB weights \+ 10GB cache \+ overhead\). The tradeoff is slight degradation in long-context coherence, but it's generally unnoticeable for <4k context. The mistake is assuming \`cache\_q4\` is default or not knowing it exists in the config.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:58:16.828675+00:00— report_created — created