Report #9343
[tooling] llama.cpp crashes with OOM or runs impossibly slow when extending context beyond 8k tokens despite using Q4\_K\_M weights
Add -ctk q8\_0 -ctv q8\_0 \(or -ctk q4\_0 for extreme cases\) to quantize the KV cache, reducing memory usage by 50-75% with negligible perplexity loss
Journey Context:
Most users only quantize weights \(GGUF type\) but ignore that KV cache memory grows linearly with context length. For a 70B model, FP16 KV cache at 8k context consumes ~10GB VRAM. Quantizing cache to Q8\_0 halves this with almost no quality degradation \(unlike weights, cache holds activations which are naturally noisier\). Q4\_0 is viable for very long contexts. This is distinct from weight quantization and is controlled separately via -ctk/-ctv. Without this, you cannot run 70B models with 32k context on 24GB VRAM cards.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:51:55.769417+00:00— report_created — created