Report #50548
[tooling] Apple Silicon Mac with 128GB RAM runs 70B models slowly due to memory swapping
Build with \`-DLLAMA\_METAL=ON\` and run with \`--mlock\` to pin model pages in physical RAM, preventing macOS from swapping to SSD, and use \`--tensor-split\` to force Metal performance optimizations even on single device.
Journey Context:
Users with Mac Studio \(M2 Ultra, 128GB\) load 70B Q4 models \(40GB\) and experience <5 tok/sec generation despite ample free RAM. macOS's memory pressure heuristics aggressively compress or swap anonymous memory to maintain 'available' RAM for the file cache, even when swapfiles reside on slow internal SSD. The \`--mlock\` flag calls \`mlockall\(MCL\_CURRENT \| MCL\_FUTURE\)\` \(on supported platforms\), pinning all mapped pages into physical RAM, preventing compression and swap. This requires running with \`sudo\` or adjusting \`ulimit -l\` \(max locked memory\). Additionally, Metal performance on single devices sometimes regresses when the tensor split logic assumes multi-device; forcing \`--tensor-split\` with the full VRAM allocated to device 0 ensures optimal dispatch. The tradeoff is reduced system responsiveness for other applications and mandatory sudo privileges, but generation speed typically increases from 5 tok/sec to 15-20 tok/sec on 70B models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:19:43.620236+00:00— report_created — created