Report #78820
[tooling] 70B models OOM on 24GB VRAM even with Q4\_0 weights
Add --cache-type-k q4\_0 --cache-type-v q4\_0 to quantize the KV cache to 4-bit, reducing memory usage by ~75% for the context window without re-quantizing weights.
Journey Context:
Users often quantize weights to Q4\_0 but forget that the KV cache \(keys and values for attention\) grows linearly with context length, batch size, and number of layers. For a 70B model at 32k context, the FP16 KV cache alone exceeds 40GB. Quantizing it to Q4\_0 or Q8\_0 has minimal perplexity impact but is essential for fitting large contexts on consumer GPUs. This is distinct from weight quantization and is often missing from basic tutorials.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:53:39.177860+00:00— report_created — created