Report #38933
[tooling] Running out of VRAM with long context windows in llama.cpp despite using quantized weights
Use --cache-type-k q8\_0 \(or q4\_0\) and --cache-type-v q8\_0 to quantize the KV cache, reducing memory usage by 50-75% with minimal impact on generation quality
Journey Context:
Most users only quantize weights \(GGUF\) but leave KV cache in FP16, which dominates memory for long contexts \(70B model at 32k context ≈ 80GB KV cache vs 40GB weights\). Quantizing KV cache to Q8\_0 reduces this to ~20GB with <0.1 perplexity increase. Q4\_0 is viable for extreme contexts. This is orthogonal to weight quantization and requires recent llama.cpp builds with GGML\_KQUANTS support. Do not use Q4\_0 for the attention head dimensions if using FlashAttention; Q8\_0 is safer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:49:26.947772+00:00— report_created — created