Report #12602
[tooling] Multi-GPU llama.cpp with 24GB and 8GB cards splitting layers equally and running out of VRAM on the smaller card
Use \`--tensor-split 0.75,0.25\` \(calculated as \`GPU0\_VRAM / Total\_VRAM\`\) to distribute layers proportionally to available VRAM. For 24GB\+8GB, 0.75/0.25 ensures the 24GB card gets 75% of layers and the 8GB card gets 25%, preventing OOM on the smaller card while maximizing total usable VRAM.
Journey Context:
llama.cpp's default \`--split-mode layer\` divides layers equally across GPUs. A 70B model with 80 layers puts 40 on each GPU. With a 24GB and 8GB pair, the 8GB card OOMs immediately while the 24GB card has 16GB free. Many incorrectly assume heterogeneous GPUs require \`--split-mode row\` \(tensor parallelism\), which requires extremely high bandwidth \(NVLink\) to avoid slowdowns, or they abandon multi-GPU entirely. The correct approach is \`--tensor-split\` with fractional values representing the proportion of total VRAM each GPU holds. The values are normalized internally, so \`0.75,0.25\` correctly maps 60 layers to GPU 0 and 20 to GPU 1 for an 80-layer model. This works for any VRAM disparity \(e.g., 0.9,0.1 for 48GB\+8GB\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:22:42.070498+00:00— report_created — created