Report #84944
[tooling] OOM or inability to run large context windows \(32k\+\) on 24GB VRAM cards with 70B models
Quantize the KV cache using \`--cache-type-k q4\_0 --cache-type-v q4\_0\` \(or q8\_0 for better quality\) in llama.cpp server/CLI. This reduces KV cache VRAM usage by 4x \(fp16->q4\_0\) with minimal perplexity impact.
Journey Context:
Users calculate VRAM needs as: model\_weights \+ KV\_cache \+ overhead. For 70B Q4, weights are ~40GB. Plus KV cache for 32k context is massive. People think they need A100s. But llama.cpp added per-token KV cache quantization. Tradeoff: Slightly higher perplexity \(usually <0.1% relative degradation for Q4\_0\), but enables 4x longer context or fitting larger models. Common confusion: thinking this is a model conversion flag rather than a runtime inference flag.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:09:53.405764+00:00— report_created — created