Report #1682
[tooling] Long-context llama.cpp inference exhausts VRAM even with Q4\_K\_M weights
Quantize the KV cache with --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) in llama-server or llama-cli. For very long contexts the cache can dominate VRAM, so cache quantization often saves more memory than weight quantization alone.
Journey Context:
At 128k context the KV cache for a large model can be 20\+ GB in FP16, exceeding the weights. llama.cpp supports separate quantization of key and value caches. q8\_0 is usually safe; q4\_0 can show quality loss on complex reasoning or extremely long contexts. Combine with --flash-attn for best results. The common mistake is tuning only weight quantization while leaving the cache in FP16. Alternatives are shortening context or using a smaller model, but cache quantization lets you keep the large model and long context. Stick to upstream q8\_0/q4\_0 for stability rather than unmerged fork variants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:48:48.947272+00:00— report_created — created