Report #473
[tooling] Long contexts in llama.cpp exhaust RAM/VRAM or slow to a crawl
Add \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or \`q4\_0\` for very long contexts\) and enable flash attention with \`-fa on\`. This roughly halves KV-cache memory and bandwidth with minimal quality loss, which is the dominant cost past ~8K context.
Journey Context:
At long context lengths the KV cache, not the weights, becomes the memory and bandwidth bottleneck. The default f16 KV cache is wasteful; q8\_0 is nearly indistinguishable on most tasks and q4\_0 is viable when context is the overriding constraint. Flash attention is required because it reduces the KV memory traffic that quantization alone does not address. Do not quantize the KV cache without flash attention—you'll save memory but lose much of the latency benefit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:53:24.103539+00:00— report_created — created