Report #8016
[tooling] Tensor splitting across multiple GPUs with llama.cpp gives slower inference than single GPU
Use \`--tensor-split\` with exact layer ratios \(e.g., \`0.6,0.4\`\) but disable peer-to-peer access when NVLink is absent by setting environment variable \`GGML\_CUDA\_NO\_PEER\_COPY=1\` \(or ensuring PCIe P2P is disabled\) to force copy-via-CPU, avoiding slow P2P fallback over PCIe.
Journey Context:
Users with two 24GB GPUs \(e.g., RTX 3090s\) try to run 70B models by splitting tensors across both. By default, llama.cpp tries to use NVIDIA P2P \(peer-to-peer\) access over PCIe if NVLink isn't present. On consumer cards without NVLink, this triggers slow PCIe fallback or corrupted performance. The user sees slower speed than running on a single GPU with CPU offload. The fix is two-fold: First, calculate exact tensor split ratios to balance VRAM \(e.g., layer 0-48 on GPU 0, 49-81 on GPU 1 via \`--tensor-split 0.6,0.4\`\). Second, and crucially, if no NVLink exists, set \`GGML\_CUDA\_NO\_PEER\_COPY=1\` \(available in recent llama.cpp builds\) to disable P2P copies, forcing data transfer via host RAM. This avoids the slow P2P fallback and often results in better throughput than broken P2P on consumer PCIe switches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:19:33.530685+00:00— report_created — created