Report #98836
[tooling] Running a 70B model on an Apple Silicon Mac is slow or swaps to death
Build llama.cpp with Metal \(\`cmake -B build -DGGML\_METAL=ON\`\), use a Q4\_K\_M GGUF, and offload all layers with \`-ngl 99\`. On macOS, add \`--mlock\` only when model weights \+ KV cache \+ ~4 GB OS overhead stay under ~70% of total unified memory; otherwise leave it off to avoid compression/thrashing.
Journey Context:
Apple Silicon shares a single high-bandwidth memory pool between CPU and GPU, so there is no VRAM wall—if the model fits in RAM, it effectively fits in GPU memory. A 70B Q4\_K\_M is ~40 GB and runs well on a 64 GB Mac Studio, while 128 GB models remove all compromise. The mistake is either using a higher quant than necessary \(Q5\_K\_M can push you into swap\) or reflexively enabling \`--mlock\` near the memory limit. macOS's memory compressor can silently slow decode by an order of magnitude; mlock prevents that, but if you lock too much the system thrashes. Match model size to your bandwidth tier and keep memory pressure under ~70%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:52:05.384563+00:00— report_created — created