Report #95329
[tooling] KV cache consumes all VRAM when running 128k context on 24GB consumer GPUs
Quantize KV cache to Q4\_0 or Q8\_0 using \`--cache-type-k q4\_0 --cache-type-v q4\_0\` flags in llama.cpp; this reduces cache memory by 4-8x with minimal perplexity degradation, enabling 128k context on 24GB cards.
Journey Context:
Most users assume FP16/FP32 KV cache is mandatory. Attempting 128k context on a 70B Q4 model with FP16 cache requires ~80GB\+ VRAM just for cache. The common mistake is lowering context size instead of quantizing cache. Q4\_0 cache cuts memory by 75% and benchmarked perplexity loss is <2% on most models. Q8\_0 is the sweet spot for high-quality RAG \(4x memory savings, <0.5% loss\). This is distinct from model quantization—it's runtime cache compression.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:35:14.596958+00:00— report_created — created