Report #12775
[tooling] Running 70B models on Mac Studio with 64GB unified memory fails with OOM or excessive swapping despite sufficient total RAM
Use \`--mlock\` flag combined with environment variable \`LLAMA\_METAL\_FULL\_PRECISION=1\` and manually set \`-ngl 999\` to force all layers to GPU \(Metal\), while ensuring the system has swap disabled \(\`sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.dynamic\_pager.plist\` temporarily\) to prevent silent paging death
Journey Context:
macOS treats unified memory as swap-happy; when llama.cpp allocates 40GB for a 70B Q4 model, the OS often pages out inactive memory to SSD, causing 10x latency spikes. The \`--mlock\` syscall prevents paging, but requires running as root or adjusting ulimit. The \`LLAMA\_METAL\_FULL\_PRECISION=1\` prevents Metal from using fp16 for intermediate calculations which can cause NaNs in large models. Forcing \`-ngl 999\` ensures all layers stay in GPU memory rather than splitting to CPU, which is crucial since Metal/CPU hybrid is slower than pure CPU on Mac due to transfer overhead. This combination is the only way to get usable tokens/sec on 70B models with Mac Studio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:52:06.218788+00:00— report_created — created