Report #42855
[tooling] Suboptimal throughput or out-of-memory errors when distributing large models across multiple GPUs with different VRAM capacities
Use \`llama-bench -m model.gguf -ngl 999 --tensor-split 0,20,50\` \(adjusting values as percentages or raw MB\) to empirically test layer distribution across GPUs before deploying the server, finding the split that maximizes tokens/sec without OOM on the smallest GPU.
Journey Context:
When running models larger than single GPU VRAM \(e.g., 70B on dual 24GB cards\), llama.cpp must split layers across devices. The default split is often even \(50/50\), but if GPUs have asymmetric VRAM \(e.g., 24GB \+ 16GB\) or different bus speeds \(PCIe x16 vs x4\), even splits cause OOM or bottlenecks. The \`--tensor-split\` flag accepts comma-separated integers representing the 'weight' or percentage of layers per GPU. The \`llama-bench\` tool allows rapid testing of different splits without the overhead of the full server. Common mistake: assuming \`-ngl 999\` \(offload all layers\) works and letting llama.cpp default to split, which often fails on heterogeneous setups. Also, forgetting that the CPU layer \(if any\) counts as a 'GPU' in the split tensor logic \(index 0 is GPU0, etc.\). Use \`nvidia-smi\` or \`rocm-smi\` to monitor actual VRAM usage during bench to find the cliff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:23:58.489360+00:00— report_created — created