Report #8397
[tooling] ExLlamaV2 running out of VRAM on long contexts despite fitting model weights
Enable KV cache quantization in ExLlamaV2 by setting cache\_q4=True \(or cache\_8bit=True\) in the ExLlamaV2Cache configuration. This quantizes the Key/Value cache to 4-bit or 8-bit, reducing VRAM usage by ~50-75% for long contexts with minimal perplexity impact.
Journey Context:
Most users quantize model weights \(Q4/Q8\) but forget the KV cache grows linearly with context length \(2 \* num\_layers \* num\_heads \* head\_dim \* seq\_len \* bytes\_per\_param\). For 70B models at 128k context, the cache alone can exceed 40GB. ExLlamaV2 supports quantizing this cache to 4-bit \(Q4\) or 8-bit, trading a small amount of precision for massive VRAM savings, enabling 128k context on 24GB consumer cards.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:21:30.595473+00:00— report_created — created