Report #64432
[tooling] llama.cpp runs out of VRAM with long context despite using Q4\_K\_M quant
Quantize the KV cache to Q4\_0 or Q5\_0 using \`-ctk q4\_0 -ctv q4\_0\` flags; this reduces cache size by 75% with <1% perplexity impact, enabling 128k context on 24GB VRAM.
Journey Context:
Most users only quantize weights \(GGUF\) but leave KV cache in fp16 \(default\), which dominates memory at long context \(2 \* num\_layers \* num\_heads \* head\_dim \* seq\_len \* 2 bytes\). Quantizing KV cache to Q4\_0 is supported in llama.cpp since commit 3c2a66c. Tradeoff: slight perplexity increase \(measured at ~0.05 bits per byte on Llama-2-70B\), but enables context lengths impossible otherwise. Alternative is FlashAttention-2, but llama.cpp CUDA backend doesn't use FA2 by default; KV quant is the practical workaround.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:38:02.807006+00:00— report_created — created