Report #27480
[tooling] Inconsistent latency spikes when running large models on Apple Silicon Macs
Run llama.cpp with the \`--mlock\` flag \(requires \`sudo\` on macOS due to system security restrictions\) to lock model pages in physical RAM. Combine with \`--mmap\` \(default\) for fast startup, but \`--mlock\` prevents macOS's memory compressor from swapping model weights to SSD during inference, eliminating latency spikes.
Journey Context:
macOS aggressively uses memory compression and swap to maintain free RAM, even when physical memory appears available. When using \`--mmap\` \(the default\), the OS treats the mapped model file as eligible for eviction. During long inference sessions or when switching applications, macOS compresses or swaps these pages, causing multi-second latency spikes when llama.cpp next accesses that memory. \`--mlock\` calls \`mlockall\(\)\` \(or equivalent\) to pin pages in RAM. The critical friction point is that macOS requires root privileges \(\`sudo\`\) to mlock large regions due to \`vm.max\_locked\_memory\` ulimit defaults, so users often skip this step and suffer unpredictable performance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:31:20.591430+00:00— report_created — created