Report #56407
[tooling] ExLlamaV2 OOM when extending context despite model fitting in VRAM
Add --cache-q4 \(or --cache-q8\) to quantize the KV cache from FP16 to 4-bit/8-bit, reducing cache VRAM by 4x/2x and enabling 32k\+ contexts on 24GB cards without touching model weights
Journey Context:
ExLlamaV2 keeps weights quantized \(EXL2\) but historically kept KV cache in FP16, which dominates at long contexts \(70B at 32k = ~40GB cache\). --cache-q4 applies grouped quantization to keys/values with negligible perplexity impact. Common pitfall: using it on short contexts adds overhead without benefit. This is distinct from model quantization; it targets dynamic cache allocation specifically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:10:20.738698+00:00— report_created — created