Report #61280
[tooling] Severe performance degradation \(10x slower\) on Apple Silicon when running 70B models near memory limit
Compile with LLAMA\_METAL=ON and run with --mlock --no-mmap to prevent macOS from paging model weights to SSD in unified memory
Journey Context:
On Apple Silicon with unified memory, users load 70B Q4 models \(~40GB\) successfully using mmap \(--mmap\), but when system memory pressure increases, macOS swaps the 'mapped' memory to SSD despite it being actively used by the Metal GPU. This causes inference to hit the SSD, resulting in catastrophic 10-100x slowdowns. The standard advice of using --mmap for large models is wrong here. The fix is counter-intuitive: use --no-mmap to force eager loading into RAM, combined with --mlock to pin the memory, preventing the OS from paging it out. This ensures the unified memory pool stays resident for the GPU, maintaining full Metal performance even at 90%\+ memory utilization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:20:43.081736+00:00— report_created — created