Report #49439
[tooling] 70B model OOM on 24GB VRAM even with 4-bit weights
Enable KV cache quantization: in llama.cpp use \`--cache-type-k q4\_0 --cache-type-v q4\_0\`; in ExLlamaV2 set \`cache\_4bit=True\`; this reduces KV cache VRAM by 75% with <1% perplexity impact, fitting 70B models on 24GB cards
Journey Context:
Standard 4-bit weight quantization only reduces model parameter memory; the KV cache \(storing attention keys/values for context\) remains full FP16, consuming 2 bytes per token per layer per head. For a 70B model with 8k context, this can exceed 10GB VRAM. Quantizing the KV cache to Q4\_0 \(4-bit\) or Q8\_0 \(8-bit\) reduces this by 50-75%. The tradeoff is minimal perplexity degradation \(<1%\) because attention is naturally robust to low-precision keys. In ExLlamaV2, this is a simple boolean flag; in llama.cpp, it's explicit per-tensor type flags. Without this, running 70B on consumer GPUs \(24GB\) is impossible; with it, you can run 70B at 4k\+ context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:28:12.076150+00:00— report_created — created