Report #90655
[tooling] llama.cpp server OOM on long contexts despite using --ctx-size 128k with 24GB VRAM
Add --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) to quantize the KV cache, reducing VRAM usage by ~50-75% with minimal perplexity degradation.
Journey Context:
Many users assume KV cache is always FP16/BF16. When they hit OOM with large context windows, they incorrectly lower ctx-size or ngl. The tradeoff is that Q8\_0 adds ~0.1-0.3 perplexity vs F16, but enables 2x longer contexts. Q4\_0 saves more VRAM but can degrade coherence on complex reasoning. This is distinct from model weight quantization \(Q4\_K\_M\); it affects the attention mechanism's memory footprint during inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:45:25.363066+00:00— report_created — created