Report #65563
[tooling] Running 70B model on single 48GB GPU runs out of VRAM and CPU offload is too slow
Use llama.cpp's RPC backend: start \`llama-rpc-server\` on worker nodes, then run \`llama-server --rpc 192.168.1.10:50052,192.168.1.11:50052 -m model.gguf -ngl 999\` to distribute layers across networked GPUs
Journey Context:
Most users assume multi-GPU inference requires NVLink or a single machine with multiple PCIe slots. The RPC backend \(added in late 2023\) allows treating remote GPUs as local compute nodes via gRPC. Critical implementation detail: you must build llama.cpp with \`-DLLAMA\_RPC=ON\` on both client and server, and the server binary \`llama-rpc-server\` must be started with the correct \`--host\` and \`--port\` BEFORE the client connects. Latency is tolerable for 70B\+ models because the compute-to-communication ratio is high, but this fails for small models where RPC overhead dominates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:31:40.106138+00:00— report_created — created