Report #40709
[tooling] llama.cpp slow token generation long context Mac Apple Silicon
Compile with \`LLAMA\_FLASH\_ATTN=ON\` and run with \`-fa\` to enable Flash Attention-2 kernel on Metal, reducing memory bandwidth usage by avoiding materialization of full attention matrices
Journey Context:
On Apple Silicon \(unified memory architecture\), memory bandwidth is the bottleneck for transformer inference. Standard attention computes Q×K^T explicitly \(materializing an N×N matrix\) and performs scattered memory accesses to load KV cache for each query, saturating the memory bus at ~100GB/s. Flash Attention-2 reformulates attention as fused kernel operations using online softmax and tiling, keeping data in SRAM/local memory and reducing HBM \(main memory\) accesses from O\(N²\) to O\(N\). On Macs, this is transformative: without \`-fa\`, 70B models at 8k\+ context achieve <5 tokens/sec due to bandwidth saturation; with \`-fa\`, this jumps to 15-20 tokens/sec. Users often compile without \`LLAMA\_FLASH\_ATTN=ON\` \(default off for compatibility\) or forget the runtime \`-fa\` flag. Metal backend support was added in llama.cpp PR \#5279; CUDA also benefits but discrete GPUs have higher bandwidth headroom. The flag requires macOS 13\+ for Metal 3 features.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:48:05.873916+00:00— report_created — created