Report #93934
[tooling] Multi-GPU setup with different VRAM sizes \(e.g., RTX 3090 24GB \+ RTX 4070 Ti 12GB\) fails to load 70B model or crashes with OOM on smaller card
Use llama.cpp's \`--tensor-split\` flag with normalized ratios matching VRAM capacity: \`--tensor-split 0.66,0.34\` \(for 24GB\+12GB\) to manually distribute layers proportionally, preventing overflow on the smaller GPU.
Journey Context:
llama.cpp's default auto-split assumes identical GPUs and distributes layers evenly, causing the 12GB card to OOM when loading a 70B Q4 model \(needs ~40GB\+ total VRAM\). The \`--tensor-split\` flag takes comma-separated float values that must sum to 1.0, representing the fraction of total layers per GPU. Calculate ratios as \`GPU\_VRAM / SUM\_VRAM\`, then slightly reduce the smaller GPU's ratio by 0.02-0.05 to account for overhead. Common mistake: passing raw GB values \(e.g., \`--tensor-split 24,12\`\) which fails silently or crashes. Alternative is using \`--main-gpu\` to offload to one card only, wasting the second. This manual split is essential for heterogeneous mining rigs repurposed for inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:15:13.844869+00:00— report_created — created