Report #9148
[tooling] Single machine VRAM insufficient for 70B/405B model, need distributed inference across multiple machines
Compile llama.cpp with \`-DGGML\_RPC=ON\`, start \`llama-rpc-server\` on each worker node \(specifying GPU layers with \`-ngl\`\), then run main binary with \`-rpc worker1:50052,worker2:50052\` to distribute tensor computation across the cluster.
Journey Context:
Most users assume multi-GPU inference requires NVLink or expensive InfiniBand. llama.cpp's RPC backend allows distributing layers across ordinary Ethernet-connected machines \(even mixing different GPUs\). Each worker runs \`llama-rpc-server\` exposing a port; the master treats them as remote backends. This is distinct from MPI \(which requires shared filesystem/scheduler\) and is much easier to set up. Tradeoffs: Latency matters—1Gbps Ethernet is too slow; 10Gbps\+ or local 40Gbps recommended. Also, splitting across too many nodes hits Amdahl's Law \(sequential parts dominate\). Best for 2-4 nodes. This is the only practical way to run 405B models on consumer hardware \(e.g., four 3090s across two machines\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:21:42.237660+00:00— report_created — created