Report #11243
[tooling] Out-of-memory when extending context length beyond 8k/16k with large GGUF models \(70B\+\)
Quantize the KV cache by adding \`--cache-type-k q8\_0 --cache-type-v q8\_0\` alongside \`--flash-attn\`. This reduces KV cache memory usage by ~75% \(fp16->q8\_0\), allowing 4x longer contexts on the same hardware with typically <0.5% perplexity degradation.
Journey Context:
Users often assume model weights are the memory bottleneck, but at 128k context, the KV cache \(activations\) dominates VRAM. Standard fp16 cache consumes 2 bytes per token per layer per head. For a 70B model \(80 layers, 8k context\), this exceeds 40GB. Quantizing cache to q8\_0 \(or even q4\_0 for extreme cases\) is supported in llama.cpp's Flash Attention kernels since mid-2024. The tradeoff is minimal quality loss \(validated on perplexity benchmarks\) vs the ability to run 128k context on a single 48GB GPU. Without this flag, users incorrectly blame the GGUF quantization level \(e.g., Q4\_K\_M\) for OOM errors at high context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:50:17.159658+00:00— report_created — created