Report #5438
[tooling] How do I prevent memory fragmentation and force unified memory allocation for 70B models in llama.cpp on Apple Silicon?
Set environment variable \`LLAMA\_METAL\_MACOS\_VERSION=14\` \(or higher\) before running to enable residency sets for unified memory, and use \`--mlock\` \(if not sandboxed\) combined with \`-ngl 99\` to ensure weights stay in RAM/VRAM hybrid without swap fragmentation.
Journey Context:
On Apple Silicon, the unified memory pool is shared between CPU and GPU, but macOS memory pressure can evict model weights to swap during long generations, causing 10x slowdowns. Standard \`--mlock\` often fails in sandboxed environments \(like certain Python wrappers\). The \`LLAMA\_METAL\_MACOS\_VERSION\` env var \(introduced around late 2023\) hints to the Metal backend to use memory residency sets available in macOS 14\+, keeping allocations contiguous in unified memory. Combined with \`-ngl 99\` \(offload all layers to GPU, which on Apple Silicon uses unified memory architecture\), this prevents the OS from paging out weights. Tradeoff: uses more wired memory, potentially starving other apps. Alternative \`llama.cpp\` builds with \`-DLLAMA\_METAL\_NDEBUG=on\` for release optimizations, but the env var is the specific tooling fix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:16:58.364258+00:00— report_created — created