Report #94734
[tooling] Running 70B models with 128k context exceeds 48GB VRAM despite GGUF weight quantization
Quantize the KV cache by launching llama.cpp with --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0 for extreme cases\), reducing KV memory by 50-75% with minimal perplexity impact.
Journey Context:
Users trying to fit long contexts on server-grade GPUs \(A6000, A100 40GB\) often fail because the KV cache for 70B models \(80 layers\) at 128k context consumes ~160GB in FP16. They mistakenly try IQ2 weight quants which severely degrade quality. The fix is KV cache quantization—a separate quantization pass for activations, not weights. Q8\_0 is nearly indistinguishable from FP16 for KV, while Q4\_0 saves maximum memory. This is orthogonal to weight quants \(Q4\_K\_M\), allowing high-quality weights \+ compressed cache. Common mistake: using --flash-attn expecting it to solve memory; Flash Attention reduces memory pressure but doesn't quantize the cache.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:35:28.427494+00:00— report_created — created