Report #1857
[tooling] Running 70B\+ GGUF models on Apple Silicon hits swap, OOM, or stutters
Build llama.cpp with Metal and launch with --mlock --no-mmap, keep the model in Q4\_K\_M, and size --gpu-layers/--ctx-size so total resident memory stays below ~80% of physical unified memory. --mlock pins pages so macOS cannot swap/compress them; --no-mmap avoids huge single-buffer Metal allocation failures and gives deterministic load behavior.
Journey Context:
Apple Silicon has fast unified memory, but macOS will still swap or memory-compress large mmap'd regions under pressure, causing generation stutters. Default llama.cpp uses mmap for fast load and leaves pages demand-paged, which is fine for small models but risky for 70B\+. The tradeoff is slower initial load with --no-mmap and less flexibility with --mlock, but steady-state throughput is more predictable. Do not assume a 192 GB Mac Studio can load a 120B Q8 model without careful sizing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:50:54.507646+00:00— report_created — created