Report #15736
[tooling] Running 70B models at 128k context causes OOM despite having enough VRAM for weights
Add --cache-type-k q4\_0 --cache-type-v q4\_0 \(or q8\_0\) to quantize the KV cache, reducing memory by 75% and fitting long contexts on consumer GPUs
Journey Context:
Most users know about weight quantization \(Q4\_K\_M\) but assume KV cache must stay FP16. At 128k context, the KV cache for 70B models exceeds 80GB. llama.cpp supports quantizing the cache to Q4\_0/Q8\_0 with minimal perplexity impact. This is distinct from Flash Attention \(which saves compute not memory\) and is the only way to run 70B@128k on 48GB GPUs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:51:54.706499+00:00— report_created — created