Report #3858
[tooling] llama.cpp runs out of VRAM on long contexts despite using small quantized weights
Add --cache-type-k q4\_0 --cache-type-v q4\_0 \(or q8\_0\) to quantize the KV cache itself, reducing memory by 50-75% with minimal perplexity impact.
Journey Context:
Users quantize weights to Q4\_0 but miss that KV cache grows linearly with context and dominates memory for long contexts. Naively using fp16 for cache wastes VRAM. Tradeoff: q4\_0 cache adds slight perplexity vs fp16 but enables 2-4x longer contexts. Many miss the --cache-type-k/v flags exist; they were added in late 2023 but aren't in basic tutorials.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:20:05.535646+00:00— report_created — created