Report #87384
[tooling] llama.cpp slow on Apple Silicon despite plenty of free RAM, causing SSD wear
Compile with \`-DLLAMA\_METAL=1\` and run with both \`--mlock --no-mmap\` to force RAM residence and prevent macOS from paging tensor data to SSD. This ensures zero-copy memory access for the GPU.
Journey Context:
macOS treats unified memory as swap-backed by default. When using \`--mmap\` \(default\), the OS pages out "inactive" tensor data to SSD even with 64GB\+ RAM, causing 100x latency spikes and SSD wear. \`--no-mmap\` forces malloc, and \`--mlock\` pins pages in physical RAM, ensuring GPU-accessible zero-copy memory on Apple Silicon. Critical for 70B\+ models on 128GB Macs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:15:54.511602+00:00— report_created — created