Report #12209
[tooling] Running out of VRAM/RAM with long context windows in llama.cpp despite using quantized models
Add \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or \`q4\_0\` for extreme cases\) to quantize the KV cache itself, reducing memory usage by 50-75% with minimal perplexity impact.
Journey Context:
Users typically try to quantize model weights further \(e.g., to Q2\_K\) which destroys quality, or slash context size. The KV cache for 128k context with large models consumes massive memory \(batch\_size \* context \* hidden\_size \* layers \* 2 \[k\+v\] \* 2 bytes \[fp16\]\). Quantizing this cache to 4 or 8 bits is the correct tradeoff—Q8\_0 is nearly lossless, Q4\_0 is acceptable. Many don't know these flags exist or fear degradation, but this is the only way to run 128k context on 24GB cards.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:19:38.789705+00:00— report_created — created