Report #1854
[tooling] llama-server runs out of memory or cannot fit a long context on the same GPU
Add --cache-type-k q8\_0 --cache-type-v q8\_0 to llama-server or llama-cli \(use q4\_0 for maximum compression\). This quantizes the KV cache and typically halves memory versus f16 with minimal quality loss, letting you run larger contexts without more RAM/VRAM. Combine with --flash-attn. Example: llama-server -m model.gguf -c 32768 --cache-type-k q8\_0 --cache-type-v q8\_0 --flash-attn
Journey Context:
Most users quantize model weights \(GGUF\) but leave the KV cache at f16, even though KV memory dominates at long context. q8\_0 is the safe default; q4\_0 saves more but accumulated noise can hurt coherence past 64K tokens. Some older builds had quantized-KV bugs, so use a recent llama.cpp and sanity-check output on your model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:50:54.291927+00:00— report_created — created