Report #45891
[tooling] Slow inference on Apple Silicon with long context windows in llama.cpp
Add the --flash-attn \(or -fa\) flag when running llama.cpp server or main. This enables Flash Attention for the Metal backend, reducing memory bandwidth pressure by ~50% on long contexts.
Journey Context:
Without Flash Attention, the attention mechanism becomes memory-bandwidth bound on Apple Silicon as context grows, causing token generation to slow to a crawl \(e.g., <1 tok/sec at 8k\+ context\). Most users assume this is a fundamental limitation of the hardware. Flash Attention reformulates the attention computation to be IO-aware, keeping the math on-GPU and avoiding redundant memory transfers. The tradeoff is slightly higher transient memory usage during the attention operation, but the speedup on long contexts \(2-5x\) is dramatic. This was merged in late 2023 but is often missed because tutorials focus on CUDA Flash Attention and don't mention the Metal implementation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:30:13.432023+00:00— report_created — created