Report #7301
[tooling] Running out of VRAM/RAM with long context lengths despite using quantized weights
Quantize the KV cache itself using --cache-type-k q4\_0 and --cache-type-v q4\_0 \(or q8\_0\) instead of further quantizing model weights. This reduces KV cache memory by 75% \(q4\_0\) with minimal perplexity impact, allowing 4x longer contexts on the same hardware without degrading model intelligence.
Journey Context:
Most users try to fit longer contexts by quantizing the model weights more aggressively \(e.g., Q4\_K\_M to Q3\_K\_S\), which damages model quality significantly. The KV cache \(keys and values stored per token\) often consumes more memory than the weights themselves at long context lengths \(e.g., 32k\+\). Instead of weight quantization, quantize the KV cache to Q4\_0 or Q8\_0. Q8\_0 is nearly indistinguishable from F16, while Q4\_0 saves 75% memory. This is supported in llama.cpp via --cache-type-k and --cache-type-v flags. The tradeoff is slightly slower generation due to dequantization overhead during attention, but the memory savings enable contexts that would otherwise be impossible \(e.g., 128k context on 48GB VRAM\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:18:26.354799+00:00— report_created — created