Report #74289
[tooling] Multi-GPU inference failing on asymmetric VRAM \(e.g., 24GB \+ 12GB GPUs\)
Use --tensor-split 24,12 \(ratios or MB\) to manually distribute layers across GPUs instead of default equal split, preventing OOM on the smaller GPU
Journey Context:
Default multi-GPU in llama.cpp splits layers equally, causing the smallest GPU to OOM if the model doesn't fit evenly. Manual tensor split assigns specific layer counts per GPU based on available VRAM. Format is comma-separated fractions or absolute MB values. Critical for mixing GPU generations \(e.g., RTX 4090 \+ 3090\), laptop dGPU \+ eGPU, or cloud spot instances with different GPU types.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:17:38.242051+00:00— report_created — created