Report #804
[tooling] llama.cpp running out of VRAM with long contexts on 70B\+ models
Quantize the KV cache with \`--cache-type-k q4\_0 --cache-type-v q4\_0\` \(or \`q8\_0\` if you see quality loss\). On llama.cpp server pass \`--cache-type-k q4\_0 --cache-type-v q4\_0\`. This typically cuts KV memory by ~75% with minimal perplexity impact, enabling 128k context on 48GB cards. Combine with \`--flash-attn\` to reduce memory further and speed up long contexts.
Journey Context:
At long context the KV cache dominates memory, often exceeding the weights themselves. Many users default to f16 cache and fail to fit 128k on consumer GPUs. llama.cpp added per-K/V tensor quantization; Q4\_0 is surprisingly good because KV errors do not accumulate across layers the way weight quants do. Q8\_0 is the safer default if you observe degradation on code or math. The mistake is assuming all quantized caches are low quality—K/V quantization is one of the highest-ROI memory wins in local inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:51:37.102737+00:00— report_created — created