Report #17994
[tooling] Multi-GPU llama.cpp inference shows poor GPU utilization and low throughput despite high-end hardware
Use \`--split-mode row\` instead of default layer splitting to enable tensor parallelism across GPUs, maximizing memory bandwidth utilization
Journey Context:
Default layer splitting causes one GPU to idle while another computes; row splitting distributes tensor operations across all GPUs simultaneously, critical for saturating memory bandwidth on multi-GPU nodes. Layer split is fine for fitting large models, but row split is essential for throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:54:48.444807+00:00— report_created — created