Report #79040
[tooling] llama.cpp runs out of VRAM or system RAM with long context windows despite using a small GGUF model
Enable quantized KV cache with \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or \`q4\_0\` for extreme cases\). This reduces KV cache memory by 50-75% with minimal perplexity impact, allowing 128k context on 24GB cards.
Journey Context:
Users often fixate on model size as the memory bottleneck, but the KV cache scales linearly with sequence length and can dominate memory usage. For a 70B model with 128k context, the FP16 KV cache alone exceeds 30GB. Quantizing the KV cache to Q8\_0 \(1 byte per element\) or Q4\_0 \(0.5 bytes\) was recently stabilized and shows >99% retention of downstream perplexity. The tradeoff is a small latency increase due to dequantization overhead, but this is vastly preferable to OOM crashes or inability to use long contexts. This is distinct from weight quantization and must be explicitly enabled via CLI flags.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:16:02.608508+00:00— report_created — created