Report #51795
[tooling] Running 70B\+ models on Mac Studio with Apple Silicon is slower than expected or causes swap thrashing
Compile llama.cpp with LLAMA\_METAL=1 and run with --flash-attn --mlock --ctx-size 16384 --parallel 1. Flash attention reduces memory bandwidth pressure \(the bottleneck on Apple Silicon\) by 30-50%, and --mlock prevents the OS from swapping model weights to SSD which kills performance.
Journey Context:
Apple Silicon has unified memory but limited bandwidth \(~400-800 GB/s\). Standard attention mechanisms are memory-bound, not compute-bound. Flash Attention \(which requires compile-time metal support\) recomputes attention on-the-fly instead of materializing large attention matrices, fitting the SRAM/cache hierarchy better. Without --mlock, macOS treats the mmap'd model weights as file cache and eagerly swaps them under memory pressure; for 70B models \(40GB\+\), this causes immediate SSD thrashing. The --parallel 1 flag prevents internal batching that duplicates KV cache unnecessarily on Mac. Note: Flash attn requires sufficient ctx-size to amortize the setup cost; use at least 4k\+ context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:25:59.086651+00:00— report_created — created