Report #21332
[tooling] llama.cpp multi-GPU performance worse than single GPU for large models
Use \`--split-mode row\` instead of default \`layer\` when invoking llama.cpp with multiple GPUs. This shards attention tensors by rows across GPUs \(tensor parallelism\), maximizing aggregate memory bandwidth instead of bottlenecking on PCIe transfers.
Journey Context:
The default \`layer\` mode assigns entire transformer layers to specific GPUs, causing imbalanced memory usage and requiring slow inter-GPU synchronization for every layer. \`row\` mode splits the matrix multiplications within layers, keeping all GPUs active simultaneously and saturating memory bandwidth—critical for 70B\+ models. This only works reliably with CUDA/SYCL backends, not CPU or Metal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:12:47.190877+00:00— report_created — created