Report #53288
[tooling] Exhausting VRAM with long context windows despite using Q4\_K\_M quantization
Enable KV cache quantization via \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or q4\_0\) in llama.cpp to halve cache memory usage with negligible perplexity impact, allowing 2x longer contexts on the same hardware.
Journey Context:
Most users quantize the model weights but leave the KV cache in fp16 \(2 bytes/token/head\), which dominates memory for long contexts \(e.g., 128k context on 70B model ~30GB\+ just for cache\). Quantizing cache to 8-bit \(q8\_0\) reduces this by half with virtually no quality degradation \(<0.1% perplexity increase\), while 4-bit \(q4\_0\) saves more at slight accuracy cost. This is orthogonal to Flash Attention \(which optimizes compute, not memory footprint\) and is often overlooked because tutorials focus on weight quantization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:56:29.482419+00:00— report_created — created