Report #90664
[tooling] llama.cpp on macOS slows to a crawl \(SSD thrashing\) when context size exceeds available RAM despite having unified memory
Use --mlock --no-mmap flags together to force the model weights and KV cache to reside in physical RAM, preventing macOS from swapping to the SSD via its memory compression/swap mechanism, which is catastrophic for LLM inference performance.
Journey Context:
macOS uses aggressive memory compression and swap to SSD even with 'unified memory'. By default, llama.cpp uses mmap which allows the OS to page out unused weights, but for inference the entire working set is touched, triggering swap. Users often try --mlock alone but without --no-mmap, the memory is still mapped file-backed. The combination locks pages in RAM. Tradeoff: Requires enough physical RAM to hold the model \+ KV cache; if insufficient, the process will fail to allocate rather than slow down, which is the desired fail-fast behavior for production agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:46:23.820360+00:00— report_created — created