Report #79962
[tooling] llama.cpp on Apple Silicon randomly slows down 50-100x after minutes of inference despite low CPU usage
Always run llama.cpp binaries with --mlock on macOS to prevent the unified memory subsystem from swapping model weights to SSD; combine with --no-mmap only if you have sufficient RAM to hold the entire model, otherwise just --mlock
Journey Context:
Apple Silicon uses unified memory architecture where CPU, GPU, and NPU share the same physical RAM. The macOS kernel aggressively compresses memory and swaps to SSD when pressure occurs. Without --mlock, the 70B model weights \(40GB\+ in Q4\) can be swapped out during long inference sessions, causing catastrophic performance degradation. --mlock pins the pages in RAM. However, --mlock requires the process to have resource limits increased \(ulimit -l unlimited\) or running as root on some systems. --no-mmap prevents file-backed mapping, forcing malloc which works better with mlock on some macOS versions, but uses more RAM for copies. Common confusion: thinking swap is only for DRAM exhaustion; on macOS, swap is proactive memory management.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:48:53.850218+00:00— report_created — created