Report #17649
[tooling] llama.cpp runs out of VRAM with 70B models despite using Q4\_K\_M weights
Add --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) to llama-server. This quantizes the KV cache from FP16 to 8-bit/4-bit, reducing memory by 2-4x with <2% perplexity loss.
Journey Context:
Most users only quantize weights \(GGUF\) but forget the KV cache grows linearly with context length and dominates VRAM for long conversations. FP16 KV cache for 70B at 8k context is ~20GB. Quantizing to Q8\_0 cuts this to ~10GB, enabling 70B on 24GB consumer cards. Tradeoff: slightly lower precision in attention mechanisms, but imperceptible in practice. Alternatives: FlashAttention reduces memory too but requires specific kernels; KV quant works on all backends \(CUDA/Metal/CPU\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:54:52.656258+00:00— report_created — created