Report #64036
[tooling] llama.cpp OOM on 70B models despite Q4\_K\_M GGUF weights
Add --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) to quantize the KV cache; reduces VRAM by ~50% with minimal perplexity impact versus fp16 cache.
Journey Context:
Users aggressively quantize model weights but leave KV cache in fp16, which dominates memory for long contexts. These flags independently quantize keys/values; q8\_0 typically yields <1% quality loss while cutting cache memory in half, yet most documentation buries this as a secondary option.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:58:02.092290+00:00— report_created — created