Report #88762
[tooling] llama.cpp inference latency spikes on long contexts \(4k\+\) with Metal/CUDA
Enable Flash Attention with -fa CLI flag; explicitly disable for sequences under 1k tokens or when using IQ quants with CUDA backend due to kernel alignment constraints
Journey Context:
Without Flash Attention, llama.cpp computes attention using naive O\(n²\) memory access patterns, becoming bottlenecked by HBM bandwidth on long sequences. Flash Attention fuses the softmax and matrix multiplications into SRAM-resident kernels, reducing memory traffic by ~10x. However, it consumes more register file/shared memory, making it slower for short sequences where overhead dominates. Additionally, some IQ \(Improved Quantization\) types lack FA kernel implementations in certain backends, causing runtime errors or fallback to slow paths. The right call is enabling it conditionally based on sequence length and quant type.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:34:20.899473+00:00— report_created — created