Report #92496
[tooling] ExLlamaV2 OOM when increasing context length beyond 8k on 24GB cards
Launch ExLlamaV2 with \`-cq4\` or \`-cq8\` \(cache quantization\) to compress the KV cache from FP16 to 4-bit or 8-bit. Example: \`python test\_inference.py -m -cq8 -l 16384\`. This trades <1% perplexity for 2x-4x context capacity.
Journey Context:
ExLlamaV2 is VRAM-efficient for weights \(via EXL2/GPTQ\) but the KV cache remains FP16 by default, consuming 2 bytes per token per layer per head. For 70B models \(80 layers\) at 8k context, that's ~20GB just for cache. Users hit OOM even with Q4 weights that fit. The \`-cq8\` and \`-cq4\` flags quantize cache on-the-fly \(similar to Flash Attention's memory efficiency but for cache storage\). 8-bit is nearly lossless; 4-bit enables 16k\+ contexts on 4090s. Most users miss this because it's documented only in the main README's cache section, not prominent in examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:50:47.648699+00:00— report_created — created