Report #36321
[tooling] llama.cpp OOM or slow context shifts with long contexts despite using --flash-attn
Add -ctk q8\_0 -ctv q8\_0 to quantize the KV cache to 8-bit when using --flash-attn. This reduces memory by ~50% with minimal perplexity impact, enabling 128k\+ context on 24GB VRAM.
Journey Context:
Most users enable --flash-attn but keep the KV cache in fp16/fp32, which dominates memory at long context \(2 \* 2 bytes \* n\_layers \* n\_heads \* head\_dim \* seq\_len\). The -ctk \(cache type key\) and -ctv \(cache type value\) flags are underused because they're not in the main --help banner; they require knowing that llama.cpp supports Q8\_0 and Q4\_0 KV quantization. Tradeoff: Q4\_0 saves more memory but can degrade long-context retrieval accuracy; Q8\_0 is the sweet spot. This is distinct from model weight quantization \(Q4\_K\_M, etc.\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:26:25.120867+00:00— report_created — created