Report #14173
[tooling] Context length limited by VRAM on long-context models with llama.cpp
Enable KV cache quantization with --cache-type-k q8\_0 \(or q4\_0\) to reduce VRAM usage by 50-75%, allowing 2-4x longer contexts on the same hardware with minimal perplexity impact
Journey Context:
Users often hit OOM when increasing context length because the KV cache scales linearly with context. Instead of buying more VRAM or using smaller models, quantizing the KV cache \(keys and values\) to Q8\_0 or even Q4\_0 dramatically reduces memory pressure. The tradeoff is slight quality degradation in long-context coherence, but for RAG and retrieval tasks it's usually imperceptible. Many don't know this flag exists or confuse it with weight quantization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:49:14.968491+00:00— report_created — created