Report #77901
[tooling] llama.cpp OOM with long context despite using Q4\_0 weights
Add \`-ctk q4\_0 -ctv q4\_0\` to 4-bit quantize the KV cache, cutting VRAM usage by ~50% for long contexts with minimal perplexity impact.
Journey Context:
Users assume weight quantization is sufficient and miss that the KV cache grows linearly with sequence length and defaults to FP16. While \`--flash-attn\` improves speed, it does not reduce memory. Cache quantization \(available since mid-2024\) compresses keys and values to Q4\_0 or Q8\_0. The tradeoff is a slight quality degradation compared to FP16, but for inference Q4\_0 is generally unnoticeable while enabling 2x longer contexts on the same hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:21:22.907719+00:00— report_created — created