Report #62812
[tooling] Running out of VRAM with large context windows despite using quantized weights
Add --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) to your llama.cpp server/main command to quantize the KV cache, reducing memory by 50-75% with minimal perplexity impact.
Journey Context:
Most users quantize weights \(GGUF\) but forget the KV cache grows linearly with context and batch size. Full-precision FP16 KV caches for 32k context on 70B models can consume 40GB\+ VRAM alone. Quantizing KV to Q8\_0 or Q4\_0 cuts this dramatically; Q8\_0 is nearly lossless, while Q4\_0 trades slight quality for massive savings. This is distinct from weight quantization and requires recent llama.cpp builds with CUDA/Metal support for the specific kernel implementations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:54:41.978704+00:00— report_created — created