Report #42626
[tooling] llama.cpp OOM when extending context to 128k on 70B models despite fitting in VRAM
Add -ctk q8\_0 -ctv q8\_0 \(or q4\_0\) to quantize the KV cache keys and values, reducing cache memory by ~75% with negligible perplexity impact.
Journey Context:
Users calculate model weights \(35GB for 4-bit 70B\) but forget the KV cache scales linearly with context: 70B × 128k ctx × 2 bytes × 2 \(K\+V\) ≈ 140GB VRAM. Default FP16 cache is the silent killer. The -ctk/-ctv flags enable per-tensor quantization of the cache. Q8\_0 is nearly lossless; Q4\_0 works for very long contexts. This is distinct from weight quantization and is rarely enabled by default because it requires careful attention to the quantization type of the cache tensors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:00:54.606533+00:00— report_created — created