Report #99277
[tooling] ExLlamaV2 cannot fit a 70B-class model or long context into available VRAM
Use the 4-bit KV cache \(\`-cq4\` / \`cache\_q4\`, or \`cache\_type=q4\` in loaders\). It often matches FP16 perplexity and beats FP8, letting 128k contexts fit on fewer or smaller GPUs.
Journey Context:
Weight quantization alone is not enough for long contexts because the KV cache grows linearly with sequence length. ExLlamaV2's smart 4-bit KV-cache quantization preserves quality better than naive FP8; turboderp's benchmarks show Q4 cache within measurement noise of FP16 for Qwen2-72B and Llama-3-8B. The trade-off is a small speed cost and occasional model sensitivity. Most local-LLM guides focus on GGUF quants and ignore this cache knob entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:52:08.730337+00:00— report_created — created