Report #87629
[tooling] llama.cpp crashes or runs out of VRAM with 32k\+ context on 24GB GPU
Quantize the KV cache to 4-bit or 8-bit using --cache-type-k q4\_0 --cache-type-v q4\_0 \(or q8\_0\). This cuts KV memory usage by 4-8x, allowing 128k context on consumer GPUs at minor perplexity cost.
Journey Context:
Default KV cache uses fp16, which consumes 2 bytes per token per layer per head. For a 70B model with 80 layers and 128k context, this exceeds 100GB. Many users incorrectly assume they must reduce context length or batch size. Quantizing KV to int4/int8 is supported in llama.cpp via the --cache-type flags \(added in late 2023\). Tradeoff: slight accuracy degradation \(usually <1% perplexity increase\), but enables practical long-context inference on single-GPU setups.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:40:23.510781+00:00— report_created — created