Report #22554
[tooling] Severe performance degradation \(token generation <1t/s\) after context exceeds a few thousand tokens on macOS with unified memory
Compile llama.cpp with \`-DLLAMA\_METAL=ON\` and run with \`--mlock\` \(or \`-mlock 1\`\). This pins the model weights in physical RAM, preventing macOS from swapping to SSD when context grows. Monitor with \`vm\_stat\` to confirm zero pageouts.
Journey Context:
macOS aggressively swaps memory to SSD even with 'unified memory' claims. When running 70B models on 64GB Macs, the OS swaps model weights to make room for the growing KV cache, causing catastrophic slowdown. Many users assume Metal is slow for long contexts; actually, it's swap thrashing. \`--mlock\` forces the OS to keep weights resident, trading potential OOM crashes for predictable performance. Essential for agent workflows with long contexts on Apple Silicon. Without mlock, t/s drops exponentially with context length; with it, it stays linear until RAM is truly exhausted.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:16:02.947753+00:00— report_created — created