Report #8552
[tooling] llama.cpp running out of VRAM with large context windows despite model fitting in GPU memory
Add --cache-type-k q4\_0 --cache-type-v q4\_0 to quantize the KV cache to 4-bit, reducing VRAM usage by ~75% for the context window with minimal perplexity impact.
Journey Context:
Many users fit the model weights in VRAM but crash when increasing context length because the KV cache \(key/value pairs for attention\) grows linearly with sequence length and is stored in full FP16/FP32 by default. Quantizing the cache to Q4\_0 trades a tiny bit of model accuracy \(usually <1% perplexity increase\) for the ability to run 4x longer contexts on the same hardware. Alternatives like --mlock or splitting across GPUs are slower or more complex.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:46:52.972648+00:00— report_created — created