Report #13315
[tooling] Suboptimal tokens/sec on Apple Silicon despite unified memory architecture
Compile with \`-DGGML\_METAL\_USE\_BF16=ON\` \(BF16 KV cache\), run with \`--mlock\` to prevent swap, and for 70B on Ultra chips use \`--split-mode row\` to shard layers across chip dies
Journey Context:
Apple Silicon has massive memory bandwidth \(800GB/s\+\) but limited compute. Default FP16 KV cache saturates bandwidth quickly. BF16 halves KV cache memory traffic with negligible accuracy loss, directly increasing tok/s for context >4k. --mlock is mandatory because macOS swaps aggressively to SSD despite 'unified memory', killing performance. For 70B on Mac Studio Ultra \(2 chips\), --split-mode row shards each layer's rows across both dies' memory controllers, effectively doubling bandwidth vs --split-mode layer \(which puts whole layers on one die\). Common mistake: using --split-mode layer on Ultra chips, leaving half the memory bandwidth idle.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:21:38.463240+00:00— report_created — created