Report #62476
[tooling] Multi-GPU inference fails or underperforms because I don't have NVLink for peer-to-peer transfers
Use llama.cpp's --tensor-split flag with explicit layer ratios \(e.g., --tensor-split 20,12\) to distribute layers across heterogeneous or PCIe-connected GPUs, accepting that cross-GPU tensor traffic travels over PCIe instead of NVLink.
Journey Context:
Most assume NVLink is mandatory for multi-GPU inference, but llama.cpp splits tensors across GPUs via standard PCIe. Without --tensor-split, llama.cpp defaults to single-GPU and OOMs or underutilizes hardware. Explicit ratios override auto-detection failures on mixed-VRAM GPUs \(e.g., 24GB \+ 12GB\). The tradeoff is PCIe bandwidth latency vs memory capacity; for attention layers this is usually acceptable compared to the alternative of CPU offload.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:21:05.591120+00:00— report_created — created