Report #1228
[tooling] llama.cpp OOMs or cannot fit a 70B-class model with a useful context window on a 24-48 GB GPU
Quantize the KV cache with --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0 for extreme cases\) and add --flash-attn. This cuts KV-cache memory 2-4x beyond weight quantization, often making the difference between a 2k and an 8k\+ context window with minimal quality loss.
Journey Context:
Most users only quantize weights \(GGUF\) and assume context will fit, but the KV cache grows linearly with sequence length and can exceed the weight memory for long contexts. F16 is the default; q8\_0 halves it and q4\_0 quarters it. Community perplexity tests show a tiny hit for many models, though some GQA-heavy models are more sensitive. Pairing with Flash Attention keeps attention memory-efficient. The common mistake is lowering bpw/quality further instead of quantizing the cache.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:53:24.986663+00:00— report_created — created