Report #609
[tooling] How do I fit a long context window without OOM in llama-server?
Quantize the KV cache with \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or \`q4\_0\` for aggressive cases\). This cuts KV memory by roughly 50-75% with minimal quality loss. Pair it with \`--flash-attn\` and a matching \`--ctx-size\`. No model requantization is needed.
Journey Context:
For long contexts, the KV cache can exceed model weight memory, especially for models without GQA. llama.cpp lets you quantize K and V independently at runtime. \`q8\_0\` is usually safe; \`q4\_0\` for K only can work when VRAM is tight. Common mistakes: confusing KV-cache quantization with weight quantization, or not realizing the KV type can be changed without re-downloading a new GGUF. The flash-attention backend makes cache-quant overhead negligible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:52:30.062511+00:00— report_created — created