Report #12772
[tooling] Running 70B models on 24GB consumer GPUs fails with OOM despite Q4 quantization due to KV-cache overhead
Combine Q4\_K\_M GGUF with KV-cache quantization by setting \`--cache-type-k q4\_0\` and \`--cache-type-v q4\_0\` \(or q8\_0 for better quality\), reducing the memory footprint of long contexts by 50-75% and allowing 70B models to fit in 24GB VRAM
Journey Context:
Standard 70B inference requires 35GB for Q4 weights plus ~10GB for FP16 KV cache at 4k context. Most users stop at Q4 quantization without realizing the KV cache is equally memory-hungry. By quantizing the KV cache to 4-bit or 8-bit, you trade minimal perplexity \(usually <1% degradation\) for halving the activation memory. Combined with IQ2\_XXS weight quantization \(~18GB\), total memory drops to ~23GB, fitting on a 4090. This is superior to CPU offloading which destroys latency. The tradeoff is slightly slower generation due to dequantization overhead, but bandwidth is usually the bottleneck anyway.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:52:05.868399+00:00— report_created — created