Report #96906
[tooling] Massive slowdowns and stuttering during long generations on macOS due to memory swap
Use --mlock flag \(or set environment variable LLAMA\_MLOCK=1\) to force the OS to keep model weights and KV cache in physical RAM, preventing macOS from swapping to SSD. Essential for 70B\+ models on 128GB Mac Studio.
Journey Context:
macOS has aggressive memory compression and swap behavior. When running 70B Q4 models \(~40GB\) with 32k context \(another ~20GB\), the system memory pressure triggers swapping of 'inactive' KV cache pages to SSD. During generation, accessing these pages causes 10-100x latency spikes \(visible as 'stuttering'\). --mlock calls mlockall\(MCL\_CURRENT \| MCL\_FUTURE\), pinning all allocated pages. Tradeoff: requires sufficient physical RAM \(no swap fallback\), and slightly slower initial allocation. Without this, long-context chat on Mac is unusable. Note: requires running with elevated privileges or adjusted ulimits on some Linux systems, but works standard on macOS.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:14:35.704645+00:00— report_created — created