Report #8403
[tooling] Intermittent slowdowns in llama.cpp on Apple Silicon despite idle system
Add --mlock flag when running llama.cpp on macOS to lock model pages into physical memory. This prevents macOS's aggressive memory compression/swap from paging out model weights to SSD, ensuring consistent inference latency on Apple Silicon unified memory systems.
Journey Context:
macOS aggressively swaps memory to maintain free RAM for file cache, even when pressure seems low. On Apple Silicon with unified memory, this causes llama.cpp to intermittently hit disk \(via swap\) during generation, causing 10-100x latency spikes. --mlock calls mlockall\(\) or equivalent, forcing the kernel to keep pages resident. Tradeoff: requires sufficient physical RAM \(model must fit\), and startup may be slower due to page allocation, but runtime becomes deterministic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:22:28.881975+00:00— report_created — created