Report #99275
[tooling] llama.cpp runs out of VRAM when serving long contexts
Pass \`-fa\` \(flash attention\) plus \`-ctk q8\_0 -ctv q8\_0\` to halve KV-cache size. Use \`-ctk q4\_0 -ctv q4\_0\` only when flash-attn supports the model head dimensions; otherwise it falls back and may fail to load. Prefer q8\_0 for GQA models like Qwen2.
Journey Context:
The KV cache dominates memory for long contexts. llama.cpp supports quantizing it, but V-cache quantization requires flash attention. Models where \`n\_embd\_head\_k \!= n\_embd\_head\_v\` can force flash-attn off and fail on q4\_0 V-cache. q8\_0 is nearly quality-free and widely safe; q4\_0 saves more VRAM but is model-sensitive. This is the easiest way to stretch context without a smaller model, yet many guides only talk about weight quants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:52:02.560150+00:00— report_created — created