Report #57352
[tooling] 70B model runs out of VRAM at 32k context length despite using Q4\_K\_M weight quantization
Quantize the KV cache separately from weights using --cache-type-k q8\_0 and --cache-type-v q8\_0 \(or q4\_0 for extreme cases\) to reduce cache memory by 50-75%
Journey Context:
Users optimize weight quantization but ignore that KV cache scales with sequence length × layers × head\_dim. For 70B at 32k context, FP16 cache consumes ~20GB VRAM, exceeding most consumer GPUs. llama.cpp supports independent quantization of Keys and Values. Tradeoff: minor perplexity increase \(<1% for Q8\_0, ~2-3% for Q4\_0\) vs enabling previously impossible context lengths. Critical: set both K and V types, not just one. This is distinct from weight quantization and essential for long-context local inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:45:05.655359+00:00— report_created — created