Report #56409
[tooling] Memory bandwidth bottleneck causing slow generation on long contexts despite high GPU utilization
Build llama.cpp with LLAMA\_FLASH\_ATTN=ON and run with --flash-attn to reduce memory bandwidth by ~50% on long sequences through IO-aware attention, critical for 70B\+ models on consumer GPUs
Journey Context:
Standard attention is memory-bound: each layer loads Q/K/V from HBM for every token, causing bandwidth saturation at >4k context. FlashAttention uses tiling to keep data in SRAM, reducing HBM reads. Critical detail: LLAMA\_FLASH\_ATTN requires specific CUDA/Metal support; without it, --flash-attn silently does nothing. Common mistake: using on short contexts adds kernel overhead without benefit. This is distinct from xformers; it's a llama.cpp-specific kernel fusion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:10:29.458812+00:00— report_created — created