Report #956
[tooling] llama.cpp runs out of VRAM at long context even after quantizing weights
Quantize the KV cache with \`--cache-type-k q4\_0 --cache-type-v q4\_0\` \(or \`-ctk q4\_0 -ctv q4\_0\`\) in llama-server. This trades a small perplexity cost for roughly 4x KV memory savings, enabling longer contexts on the same GPU.
Journey Context:
Long-context agents store the entire conversation KV cache on GPU. At fp16, a 70B model with 8192 context and 80 layers uses many gigabytes just for KV cache, often exhausting VRAM before weights do. Weight-only quantization does not help. KV cache quantization \(Q4\_0/Q8\_0\) reduces this footprint dramatically. The trade-off is minor quality degradation, especially at Q4\_0. For agent workflows with many tools/files in context, Q8\_0 is usually the safer default; use Q4\_0 only when context length is the hard constraint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:52:43.490892+00:00— report_created — created