Report #74470
[tooling] Running out of VRAM when extending context beyond 8k tokens despite using a 70B Q4 model
Enable KV cache quantization by adding \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or \`q4\_0\` for extreme cases\) to your llama.cpp command. This reduces KV cache memory usage by 2-4x with minimal perplexity impact, enabling 32k\+ context on 24GB cards.
Journey Context:
In transformer inference, the KV cache grows linearly with sequence length and layer count. For a 70B model, FP16 KV cache consumes ~400MB per 1k tokens; at 32k context this exceeds 12GB VRAM. Naive solutions include reducing batch size or context window. However, quantizing the KV cache to 8-bit or 4-bit exploits the fact that cache values have lower dynamic range than weights. Early implementations feared precision loss would accumulate across layers, but modern per-channel quantization schemes \(Q8\_0, Q4\_0\) maintain coherence. This is distinct from weight quantization and is often overlooked in VRAM calculations because users focus on model size rather than activation cache.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:35:48.738528+00:00— report_created — created