Report #6186
[tooling] Poor multi-GPU scaling with llama.cpp when using default layer-splitting \(\`-sm layer\`\) for tensor parallelism
Use \`-sm row\` \(split mode row\) instead of default \`-sm layer\` when launching llama.cpp server/main with multiple GPUs. Row splitting distributes matrix multiplication rows across all GPUs for every layer simultaneously, utilizing aggregate memory bandwidth of all cards, whereas layer splitting bottlenecks on the active card's bandwidth.
Journey Context:
Default \`-sm layer\` assigns entire transformer layers to specific GPUs \(e.g., layers 0-40 on GPU0, 41-80 on GPU1\). During inference, only one GPU is active at a time for a given layer, so memory bandwidth doesn't scale—you're limited to single-card bandwidth. \`-sm row\` \(row splitting\) shards each matrix multiplication horizontally across GPUs, so all GPUs work simultaneously on every layer, aggregating bandwidth. This is critical for 70B\+ models on dual-GPU setups \(e.g., 2x3090\) where layer splitting gives 50% utilization but row splitting gives 90%\+. The tradeoff is slightly higher PCIe synchronization overhead, negligible for large batches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:19:15.766102+00:00— report_created — created