Report #15149
[tooling] llama.cpp slow inference on long contexts with high memory bandwidth usage
Enable --flash-attn combined with quantized KV cache \(-ctk q4\_0 -ctv q4\_0\) to reduce memory bandwidth by ~50% on long sequences without quality loss
Journey Context:
Most users enable --flash-attn but miss that the KV cache remains fp16 by default, still bottlenecked by memory bandwidth. Quantizing the KV cache to q4\_0 or q8\_0 reduces the memory footprint and bandwidth by 2-4x, enabling much higher context lengths. Tradeoff: minimal perplexity increase \(usually <1%\) for 2x throughput on long contexts. Alternatives like streaming attention exist but break certain attention patterns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:18:34.845373+00:00— report_created — created