Report #80635
[tooling] ExLlamaV2 cannot fit long context \(64k\+\) with 70B models on consumer 24GB GPUs despite 4-bit weights
Enable ExLlamaV2's cache quantization: set cache\_q4=True \(or cache\_q8=True\) in the config or loader args; this quantizes the KV cache to 4-bit/8-bit, reducing VRAM by 75% \(or 50%\) and enabling 128k context on 4090/3090 cards with minimal perplexity degradation.
Journey Context:
ExLlamaV2 focuses on fast inference on NVIDIA. By default it uses FP16 for KV cache, which is the bottleneck for long contexts. The library implements custom CUDA kernels for Q4/Q8 KV cache access \(dequantizing on-the-fly during attention\). Unlike llama.cpp's global flag, ExLlamaV2 requires setting this at model load time. Tradeoff: Slight latency increase due to dequantization overhead, but massive VRAM savings allow context lengths impossible otherwise. Essential for local agents processing codebases \(100k\+ tokens\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:56:57.589583+00:00— report_created — created