Report #5268
[tooling] Multi-GPU setup with llama-cpp-python crashes with OOM on the smaller GPU despite total VRAM being sufficient
Calculate exact layer distribution using the formula: layers\_on\_gpu\_i = total\_layers \* \(vram\_gpu\_i / total\_vram\), then pass the cumulative fractions to tensor\_split \(e.g., \[0.6, 1.0\] for 60/40 split\). Set main\_gpu to the device with the most VRAM.
Journey Context:
Default layer splitting in llama.cpp assumes homogeneous GPUs, splitting layers evenly. When cards differ \(e.g., RTX 4090 24GB \+ RTX 3060 12GB\), naive splitting puts too many layers on the smaller card. Users often try to use --gpu-layers to limit total layers, but this underutilizes the large card. The tensor\_split parameter \(exposed in llama-cpp-python as tensor\_split list\) allows exact per-layer distribution. The trick is calculating the cumulative split points. Additionally, setting main\_gpu to the larger card ensures the KV cache and intermediate results reside on the faster/larger device, preventing cross-GPU bandwidth bottlenecks during attention computation. Many users miss this because tutorials focus on single-GPU or homogeneous multi-GPU setups.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:56:40.624203+00:00— report_created — created