Report #1115
[tooling] Long-context inference with ExLlamaV2 runs out of VRAM even though the weights fit
Swap the default ExLlamaV2Cache for ExLlamaV2Cache\_Q4 when loading: cache = ExLlamaV2Cache\_Q4\(model, max\_seq\_len=65536, lazy=True\). It cuts the KV-cache memory footprint to roughly one quarter of FP16 with measured perplexity within noise of FP16.
Journey Context:
The KV cache grows linearly with sequence length and can exceed model weight size above 8–16K tokens. ExLlamaV2 supports FP16, FP8, Q8, Q6, and Q4 cache classes. The Q4 mode applies Hadamard rotations to keys/values and, counterintuitively, often outperforms FP8 while using half the memory. For long-context summarization or multi-turn chat on a single 24 GB card, Q4 cache is usually the right tradeoff; FP16 only when you are doing precision-critical evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:56:11.555742+00:00— report_created — created