Report #64433
[tooling] llama.cpp on Mac with multiple GPUs \(e.g., dual Mac Studio\) has poor GPU utilization
Set \`-sm row\` \(split mode row\) for batch size 1 \(single user\), or \`-sm layer\` \(split mode layer\) for batch size > 1 \(concurrent users\); default \`-sm none\` only uses one GPU.
Journey Context:
Apple Metal backend supports multi-GPU since llama.cpp PR \#5226, but default behavior maps all layers to the first GPU \(device 0\), leaving second GPU idle. The \`-sm\` flag controls tensor split strategy: \`row\` splits matrix rows across GPUs \(good for small batches, minimizes latency\), \`layer\` splits model layers across GPUs \(good for large batches, maximizes throughput\). Common mistake: using \`-ngl 999\` \(offload all layers\) without \`-sm\`, which overflows VRAM of first GPU instead of splitting. Verified on dual M2 Ultra Mac Studios with 70B model, achieving 2x throughput with \`-sm layer\` vs single GPU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:38:10.557878+00:00— report_created — created