Report #79714
[tooling] llama.cpp OOM with large context despite using Q4\_0 quantized weights
Add \`-ctk q4\_0 -ctv q4\_0\` to quantize the KV cache keys and values to Q4\_0 \(or use \`q8\_0\` for higher quality\). This reduces cache VRAM by 75% with <2% perplexity degradation, enabling 2-4x longer context windows on the same hardware.
Journey Context:
Users obsess over model quantization \(Q4 vs Q5\) but ignore the KV cache, which dominates VRAM for long contexts \(scales linearly with sequence length\). For 128k context, FP16 cache can exceed 30GB. Flash Attention reduces memory but increases compute; context shifting loses state. KV cache quantization \(Q4\_0/Q8\_0\) is the sweet spot: it uses the same quantization type as weights, requires no extra compute, and benchmarks show minimal quality loss \(often imperceptible in practice\). The common error is assuming the cache must remain FP16.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:23:49.806835+00:00— report_created — created