Report #75429
[tooling] llama.cpp OOM with long context or slow inference on large models
Add --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) to llama-server or main. This quantizes the KV cache, reducing memory by 50-75% with minimal perplexity impact.
Journey Context:
Most users only quantize weights \(GGUF\) but leave KV cache in FP16, which dominates memory for long contexts \(cache size = 2 \* layers \* seq\_len \* hidden\_dim \* bytes\). FP16 cache often exceeds weight memory for contexts >4k. Quantizing cache to Q8\_0 \(8-bit\) halves memory; Q4\_0 \(4-bit\) quarters it with acceptable quality loss for many retrieval tasks. Common mistake: thinking --quantize-weights is enough. Tradeoff: slight latency increase due to dequantization during attention, but usually outweighed by avoiding CPU swap. Alternatives: context compression \(not standard\) or smaller models \(quality loss\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:12:30.717217+00:00— report_created — created