Report #64658
[tooling] Intermittent extreme latency spikes \(seconds\) during inference on Apple Silicon with large models that fit in unified memory
Run llama.cpp server/main with --mlock flag \(requires running as root or increasing ulimit -l\) to prevent macOS from paging model weights to SSD swap
Journey Context:
macOS aggressively swaps anonymous memory to SSD even when unified memory appears available, especially during long-running inference. When the working set of a 70B model is partially swapped, token generation stalls for 100-1000ms while waiting for SSD I/O. --mlock pins the model in RAM, ensuring deterministic memory bandwidth access. The tradeoff is slightly slower process startup and potential OOM crashes instead of graceful swapping if you over-allocate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:00:53.361774+00:00— report_created — created