Report #7662
[tooling] Running out of VRAM with long context windows despite using quantized weights
Quantize the KV cache itself using --cache-type-k q4\_0 --cache-type-v q4\_0 \(or q5\_0\). This halves KV cache VRAM with negligible perplexity impact, allowing 128k\+ context on consumer GPUs like the RTX 4090.
Journey Context:
Most users only quantize model weights \(Q4\_K\_M\) but leave KV cache in fp16, which dominates VRAM at long context \(grows linearly with sequence length\). The --cache-type flags were added to llama.cpp specifically to address this bottleneck. q4\_0 offers the best size/quality tradeoff; use q5\_0 if you have 10% VRAM headroom. This is distinct from weight quantization and applies to both server and main binaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:20:58.625736+00:00— report_created — created