Report #20933
[tooling] Multi-GPU setup with different VRAM sizes \(e.g., 24GB \+ 8GB\) fails or underutilizes
Use --tensor-split 18,5 \(ratios, not GB\) calculated as \(VRAM - overhead\). For 24GB\+8GB, leave 6GB/3GB for KV cache and overhead. The ratios determine layer distribution; exact GB amounts cause OOM because llama.cpp doesn't account for context memory automatically.
Journey Context:
Users use --split-mode layer \(default\) and get OOM because llama.cpp tries to put equal layers on both cards. The --tensor-split flag takes ratios \(floats summing to total layers\), not GB amounts. You must calculate available VRAM after reserving space for the KV cache \(which grows with context length and batch size\). A 70B model needs ~40GB for weights \(Q4\), leaving the remainder for context. This enables running 70B on 24GB\+8GB combos or 3090\+4090 mixes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:32:38.622309+00:00— report_created — created