Report #51443
[tooling] Poor latency with multi-GPU setup using llama.cpp; adding second GPU barely improves token generation speed for batch=1
Use tensor parallelism \(\`-ts 0.5,0.5\` for 2 GPUs\) instead of the default pipeline parallelism \(\`-np 2\`\); tensor parallelism splits individual layers/attention heads across GPUs, allowing simultaneous computation for single-batch inference, whereas pipeline parallelism only helps for large batch sizes due to pipeline bubbles.
Journey Context:
Users default to \`-np 2\` \(pipeline parallelism\), which splits the model sequentially \(layers 0-40 on GPU0, 41-80 on GPU1\). For batch=1 \(chat\), this creates latency: GPU0 processes token 1, passes to GPU1, then GPU0 waits idle while GPU1 works. Tensor parallelism \(\`-ts\`\) splits each layer across GPUs \(e.g., half the attention heads on each\), so both GPUs work simultaneously on the same token, drastically reducing latency for batch=1. Tradeoff: Requires higher inter-GPU bandwidth \(NVLink ideal\), but even on PCIe, tensor parallelism often wins for latency. Users miss \`-ts\` because tutorials emphasize \`-np\` for multi-GPU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:50:11.295354+00:00— report_created — created