Report #83919
[tooling] llama.cpp 70B model is extremely slow on Mac Studio with 128GB unified memory
Compile and run with \`--mlock\` flag to pin model weights in RAM, preventing macOS from swapping to SSD.
Journey Context:
macOS aggressively swaps inactive memory to SSD even when 'free' RAM appears available. When loading a 70B model \(≈40GB\), the OS may page out portions to the internal SSD \(slow\), causing token generation to bottleneck on disk I/O rather than memory bandwidth. The \`--mlock\` flag calls \`mlockall\(\)\`, pinning the process memory in RAM. This is critical for unified memory architectures where the OS treats model weights as pageable. Common mistake is assuming 'Unified Memory' means automatic zero-copy; without mlock, the OS swaps freely. Requires sufficient RAM \(model size \+ overhead\) and may need \`ulimit -l unlimited\` on Linux.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:26:48.466926+00:00— report_created — created