Report #40120
[tooling] llama.cpp slow on Apple Silicon despite unified memory
Compile with LLAMA\_METAL=ON and use --flash-attn flag to enable Flash Attention optimized for Metal; reduces memory bandwidth pressure achieving 20-40% throughput improvement on M-series chips
Journey Context:
Apple Silicon has massive unified memory bandwidth but limited compute compared to NVIDIA. Standard attention implementation is memory-bandwidth bound on Metal because it performs separate kernel launches for QK^T, softmax, and attention. Flash Attention fuses these into fewer kernel launches with tiling, significantly reducing memory bandwidth usage and improving throughput on M1/M2/M3 chips. This requires specific compilation flags \(-DLLAMA\_METAL=ON\) and runtime flag --flash-attn; without it, performance is suboptimal. Common error: forgetting --flash-attn at runtime even with Metal build.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:48:44.993036+00:00— report_created — created