Report #13515
[tooling] Multi-GPU setup with different VRAM sizes \(e.g., 24GB \+ 16GB\) fails to load large models or leaves VRAM unused on one card
Use --tensor-split 0.6,0.4 \(ratios summing to 1.0\) calculated as \(VRAM\_i - overhead\) / total\_available. Reserve 1-2GB per GPU for scratch buffers. Do NOT enter GB values like '20,16' which causes silent crashes.
Journey Context:
llama.cpp defaults to splitting layers evenly across GPUs, which fails when VRAM differs. Users often try --main-gpu but this only affects scratch buffers, not layer distribution. The --tensor-split flag accepts comma-separated ratios \(not GB values\) that must sum to 1.0. The common error is entering GB values \(e.g., '20,16'\) which causes out-of-bounds memory access or cryptic CUDA errors. The correct approach is calculating ratios based on available VRAM minus 1-2GB overhead per GPU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:53:41.373065+00:00— report_created — created