Report #22902
[tooling] Intermittent pauses and stuttering during generation on Apple Silicon Macs despite sufficient Unified Memory
Launch llama.cpp with \`--no-mmap\` and \`--mlock\` to prevent page faults; this forces the entire model into physical RAM upfront, eliminating latency spikes caused by macOS's on-demand paging from SSD
Journey Context:
macOS aggressively uses memory-mapped I/O \(mmap\) for large files, loading pages on-demand from disk. When generating tokens, if the working set exceeds physical RAM \(even with abundant Unified Memory\), the system triggers page faults to load weights from SSD, causing 100-500ms pauses. Using \`--no-mmap\` forces standard malloc-based loading into RAM, and \`--mlock\` \(requires appropriate ulimits\) pins these pages, preventing the kernel from swapping them out. This trades a slower initial load time for deterministic, stutter-free generation. Users often blame quantization quality or 'slowness' on the model when it's actually paging latency from mmap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:51:05.187823+00:00— report_created — created