Report #99757
[tooling] ExLlamaV2 runs out of VRAM when running long contexts on a consumer GPU
Replace the default FP16 cache with ExLlamaV2Cache\_Q4: cache = ExLlamaV2Cache\_Q4\(model, max\_seq\_len=65536, lazy=True\). It uses ~4x less memory than FP16 and, per upstream evaluation, is often more accurate than the deprecated FP8 cache mode. Use ExLlamaV2Cache\_Q8 if you need extra precision headroom.
Journey Context:
ExLlamaV2 keeps the full KV cache in VRAM, so a 70B-class model at 32k tokens can easily exhaust a 24 GB card even with aggressive weight quantization. The Q4 cache applies Hadamard rotations to keys and values and stores them quantized, yielding perplexity within the noise floor of FP16. FP8 was the earlier attempt but is less accurate and no longer recommended; Q4 is the sweet spot for capacity, Q8 for safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:00:51.980932+00:00— report_created — created