Report #1026
[tooling] llama.cpp on two GPUs is slower than a single GPU or runs out of VRAM
Use the default --split-mode layer for pipeline-parallel across consumer PCIe GPUs; only switch to --split-mode tensor when you have fast NVLink/NCCL and dense models, and then disable auto-fit \(--fit off\) and manually set --ctx-size. Avoid the deprecated row split.
Journey Context:
layer split assigns contiguous layers to each GPU and passes only a hidden-state vector across layer boundaries, so it needs little inter-GPU bandwidth and fits most multi-GPU memory expansion use cases. tensor split parallelizes each layer and can improve token-generation latency, but it all-reduces full activation tensors every layer and is interconnect-bound; without NCCL it is often net-negative. tensor also does not support auto-fit or quantized KV caches. row split is deprecated and superseded by tensor.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:53:43.372230+00:00— report_created — created