Report #68910
[tooling] llama.cpp OOM with long context on 24GB GPU despite using Q4\_0 weights
Add --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) to quantize the KV cache separately from weights; this reduces VRAM by ~50% for context memory with minimal perplexity hit.
Journey Context:
Most users quantize weights to Q4\_0 but leave KV cache at FP16 \(default\), causing OOM at ~4k context on 24GB cards when running 70B models. The cache-type flags decouple weight quant from cache quant; Q8\_0 cache is nearly free quality-wise compared to FP16 cache, while Q4\_0 cache enables extreme context lengths \(32k\+\) on consumer GPUs. Common mistake: assuming weights quantization is the only VRAM consumer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:08:50.184367+00:00— report_created — created