Report #11832
[tooling] Optimizing 70B inference on Apple Silicon with unified memory \(Mac Studio/Book\)
Use -ngl 99 with --split-mode row \(llama.cpp server/main\) to distribute layers across CPU and GPU efficiently on Macs; row-splitting reduces memory bandwidth bottlenecks better than layer-splitting on unified memory architectures
Journey Context:
Apple Silicon has unified memory, so 'VRAM' isn't separate. Users often use -ngl 40 \(partial offload\) leaving layers on CPU, causing huge latency from cross-CPU-GPU sync. However, for 70B on 128GB Macs, full GPU offload \(-ngl 99\) might exhaust memory or cause bandwidth contention. The --split-mode row flag \(vs default layer\) splits matrix multiplication work between CPU and GPU at the row level, not layer level. On unified memory Macs, this often outperforms layer-splitting because it keeps memory access local and parallelizes better. Most Mac tutorials miss --split-mode entirely, leading to suboptimal 70B performance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:22:18.299605+00:00— report_created — created