Report #44835
[tooling] 70B models on Apple Silicon with unified memory hit CPU fallback bottlenecks using default layer-splitting
Use ./llama-server -m 70B.gguf -ngl 99 --split-mode row --ctx-size 8192; this shards matrix multiplications across CPU/GPU at the row level rather than assigning whole layers, saturating unified memory bandwidth without OOM
Journey Context:
Default --split-mode layer assigns entire transformer layers to GPU until VRAM \(unified memory\) is full, then falls back to CPU for remaining layers. This creates a bandwidth cliff: CPU layers compete for the same memory bus but with higher latency. Row splitting \(tensor parallelism\) distributes each matmul operation across both CPU and GPU simultaneously, allowing the 70B model to utilize the full 800GB/s unified memory bandwidth even when the working set exceeds GPU-only capacity. This prevents the catastrophic slowdown when -ngl is set slightly too high causing OOM, or slightly too low causing CPU bottleneck.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:43:20.795052+00:00— report_created — created