Report #37983
[tooling] llama.cpp on macOS swaps to disk despite --mlock causing 10x slowdown
Set \`sudo sysctl iogpu.wired\_limit\_mb=\` to wire GPU memory, and combine with \`--mlock\` and \`--no-mmap\` to prevent macOS from compressing/swapping model weights.
Journey Context:
macOS Unified Memory treats allocated GPU memory as compressible/swappable by default. \`--mlock\` calls mlock\(\), but on Darwin this doesn't prevent the kernel from compressing memory \(swap to internal 'zram'\) or evicting to SSD under pressure. The \`iogpu.wired\_limit\_mb\` sysctl explicitly wires memory into physical RAM, preventing compression. Combined with \`--no-mmap\` \(to force load into RAM upfront\), this stops the 10-100x latency spikes when macOS decides to swap during inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:14:01.310186+00:00— report_created — created