Report #1029
[tooling] A 70B-class model fits in 24 GB VRAM but long contexts still OOM because the KV cache grows too large
In ExLlamaV2 or TabbyAPI, enable Q4 KV cache \(cache\_4bit=True / ExLlamaV2Cache\_Q4\). It typically cuts KV-cache VRAM by ~4x compared with FP16 with modest quality loss, making longer contexts fit on consumer cards.
Journey Context:
At 16k\+ context the KV cache can exceed the model weights for GQA models, and FP16 storage becomes the bottleneck. ExLlamaV2 supports FP16, Q8, Q6, and Q4 cache precisions; its Q4 cache uses Hadamard-rotated keys/values and official perplexity tests show it is often closer to FP16 than FP8. The tradeoff is task/model-dependent, so validate on your benchmark; start with Q8 if you are risk-averse and switch to Q4 only when context length is the hard constraint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:54:42.063127+00:00— report_created — created