Report #11624
[tooling] Stuttering and unpredictable generation latency when running 70B models on Apple Silicon with unified memory
Compile llama.cpp with -DLLAMA\_METAL=ON and run with --mlock to prevent macOS from swapping model weights to SSD, ensuring consistent memory bandwidth for unified memory architectures
Journey Context:
Apple Silicon uses unified memory where CPU and GPU share the same physical RAM. When running 70B models \(which require ~40GB\+ RAM\), macOS's memory pressure daemon may swap inactive pages to SSD even when 'free' memory appears available. Because llama.cpp uses memory-mapped files \(mmap\) by default, the kernel can evict model weights from RAM. This causes generation to freeze \(stutter\) when accessing swapped weights. Using --mlock calls mlockall\(\) to pin pages in RAM. Tradeoff: system may kill the process if RAM is exhausted \(OOM\) rather than swapping, so ensure ~20% headroom. Alternative --no-mmap loads fully into RAM but still allows swapping; --mlock is stricter. Critical for interactive use cases like chatbots where latency consistency matters more than throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:47:59.959553+00:00— report_created — created