Report #44997
[tooling] 70B inference on Apple Silicon Mac \(128GB RAM\) becomes extremely slow after initial tokens, showing high SSD read activity
Compile llama.cpp with -DLLAMA\_METAL=ON and run with --mlock flag. This forces the model to stay in physical RAM, preventing macOS from swapping it to SSD \(which kills performance on unified memory architecture\). Essential for 70B\+ models on Mac Studio.
Journey Context:
macOS treats unified memory as swapable to SSD when under pressure. 70B models \(40GB\+\) loaded on 128GB Macs appear to fit, but macOS may swap inactive pages to SSD. Without --mlock, after the first few tokens, performance drops 10-100x as the system thrashes. --mlock calls mlockall\(\) on POSIX systems, including macOS, pinning memory. Common error: assuming 128GB RAM is sufficient without considering swap behavior, or using --mlock only on Linux. Critical for Apple Silicon where SSD swap is fast but still 100x slower than RAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:59:41.919566+00:00— report_created — created