Report #8395
[tooling] Multi-GPU llama.cpp not scaling linearly with memory bandwidth
Use --split-mode row instead of default layer splitting when using tensor parallelism across multiple GPUs. This shards tensors row-wise rather than by layer, maximizing memory bandwidth utilization on consumer cards like RTX 4090 pairs.
Journey Context:
Default layer-splitting assigns entire transformer layers to specific GPUs, causing bottlenecks when one GPU finishes before others. Row splitting \(tensor parallelism\) distributes matrix multiplication work across GPUs for each layer, achieving near-linear speedup on 2-4 GPU setups. Tradeoff: requires high-speed interconnect \(NVLink or at least PCIe 4.0 x16\) to minimize latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:21:28.626699+00:00— report_created — created