Report #91052
[tooling] Intermittent slowdowns/stuttering when running 70B\+ models on Mac Studio with unified memory
Disable memory mapping via --no-mmap and optionally enable --mlock to force full RAM residency; macOS's memory compression causes thrashing with mmap on large files
Journey Context:
By default, llama.cpp uses mmap\(\) to load models, allowing the OS to page from disk on demand. On Linux/Windows with SSD this is fine. On macOS, when working set exceeds physical RAM, the kernel uses aggressive memory compression rather than swapping, causing unpredictable 10-100x latency spikes as compressed pages decompress. With 70B models on 128GB Mac Studio, mmap causes the system to treat the model as 'compressible' cache. Using --no-mmap forces immediate full load into wired RAM \(respecting --mlock if available\), eliminating compression jitter. Tradeoff: slower startup \(full read vs lazy\), but steady-state latency is deterministic. Essential for production Mac deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:25:32.200375+00:00— report_created — created