Report #13656
[tooling] Running out of VRAM with long context windows \(32k\+\) despite small model size
Enable KV cache quantization with --cache-type-k q8\_0 \(or q4\_0\) and --cache-type-v q8\_0 in llama.cpp server/main to halve or quarter KV cache memory usage at minimal perplexity cost.
Journey Context:
Standard FP16 KV cache consumes 2 bytes per token per head per layer. For a 70B model at 128k context, this exceeds 80GB VRAM. Many assume they must use tensor parallelism across multiple GPUs or CPU offloading. However, llama.cpp supports quantizing the KV cache to Q8\_0 \(1 byte\) or Q4\_0 \(0.5 bytes\) with <0.5% relative perplexity impact. The tradeoff is slight generation slowdown from dequantization, but this is dwarfed by the bandwidth savings. Common mistake: setting --ctx-size 128000 without adjusting cache-type, causing immediate OOM. Alternatives like splitting layers across GPUs adds inter-GPU latency; KV cache quant maintains single-GPU low latency while enabling massive contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:18:41.965656+00:00— report_created — created