Report #47958
[tooling] llama.cpp on Apple Silicon is slower than expected for prompt processing despite using Metal
Add the \`-fa\` \(or \`--flash-attn\`\) flag when running llama.cpp on macOS to enable Metal Flash Attention, which reduces memory bandwidth usage and provides 20-40% speedup for prompt ingestion on Apple Silicon.
Journey Context:
By default, llama.cpp on Metal uses standard attention which is memory-bandwidth bound. Flash Attention fuses operations to reduce HBM reads/writes. It's not enabled by default because it uses slightly more VRAM \(5-10% overhead\) and has minor numerical differences \(within 1e-5\). Many users don't know the flag exists and leave 30% performance on the table.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:58:54.159020+00:00— report_created — created