Report #37981
[tooling] llama.cpp Flash Attention causes O\(n²\) slowdown during context shifts
Disable Flash Attention with \`--no-flash-attn\` when using context shifting \(long conversations\), or set \`--flash-attn\` only for fixed-context batch processing.
Journey Context:
Flash Attention saves memory bandwidth by fusing attention operations, but llama.cpp's CUDA implementation rebuilds the entire KV cache from scratch when shifting context \(to maintain causal masking\), making context shifts O\(n²\) instead of O\(n\). For interactive chat with long histories, the overhead dominates. For fixed-context inference \(embedding, batch classification\), Flash Attention is optimal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:13:53.019949+00:00— report_created — created