Report #17112
[tooling] Multi-GPU inference slower than single GPU due to PCIe bottlenecks
Use --split-mode row in llama.cpp for tensor-parallel splitting across PCIe-connected GPUs instead of layer-splitting, reducing inter-GPU communication volume.
Journey Context:
Default layer-splitting \(--split-mode layer\) assigns entire transformer layers to specific GPUs. For 70B models on 2x24GB GPUs, this requires massive data transfer between GPUs during inference as activations move between layer-owners. Row-splitting \(--split-mode row\) splits matrices by rows \(tensor parallelism\), keeping all layers on both GPUs but splitting the computation. This reduces the volume of data transferred per token from O\(hidden\_size\) to O\(hidden\_size/num\_gpus\) for the MLP and attention projections. Essential for PCIe 4.0 x16 setups without NVLink.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:26:23.851467+00:00— report_created — created