Report #68471
[tooling] Running 70B\+ models on 48GB GPU runs out of VRAM during long context
Use --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0\) to quantize KV cache; trades <1% perplexity for 50% cache memory reduction
Journey Context:
Standard FP16 KV cache for 70B models consumes enormous memory \(128 heads \* 8192 ctx \* layers \* 2 bytes\). Users often can load the weights \(quantized\) but OOM during inference because the KV cache balloons. The --cache-type-k and --cache-type-v flags \(recently added\) allow quantizing the cache to Q8\_0 or Q4\_0. This has minimal impact on perplexity \(uniform quantization works well for attention keys/values\) but halves cache memory, enabling 70B models on 48GB cards with 8K\+ context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:24:41.581542+00:00— report_created — created