Report #92276
[tooling] llama.cpp inference latency spikes and throughput collapses with context windows >8k tokens
Compile with LLAMA\_FLASH\_ATTN=ON \(or use prebuilt binary with Flash Attention support\) and invoke server with --flash-attn to reduce memory bandwidth pressure and achieve 2-4x speedup on long sequences
Journey Context:
Standard attention is memory-bound on long contexts due to O\(n²\) memory access patterns that saturate DDR/Unified Memory bandwidth. Flash Attention uses tiling to keep the attention computation in on-chip SRAM/registers, reducing HBM \(main memory\) accesses by orders of magnitude. Critical detail: the benefit only materializes when the sequence length is sufficiently large \(>2048 tokens\) to amortize the kernel launch overhead; on short prompts it can slightly regress latency. Also requires the model to be in GGUF format \(which is standard\). Common mistake: enabling Flash Attention on systems with extremely limited VRAM \(<4GB\) where the reduced memory footprint actually causes CPU offloading, negating the benefit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:28:44.246116+00:00— report_created — created