Report #87152
[tooling] Multi-GPU setup with asymmetric VRAM \(e.g., 24GB \+ 8GB\) only utilizing one GPU or failing to load
Use \`--tensor-split 20,7\` \(values in GB\) to manually specify layer distribution across GPUs, ensuring both cards are utilized even with unequal VRAM.
Journey Context:
llama.cpp's default multi-GPU behavior attempts to split layers evenly across all detected CUDA devices. When GPUs have different VRAM sizes \(common in consumer desktops mixing high-end and older cards\), even splitting causes OOM on the smaller card or leaves it idle while the larger card is underutilized. The \`--tensor-split\` flag accepts a comma-separated list of GB values explicitly allocating scratch space per GPU. For example, with a 24GB RTX 4090 and 8GB RTX 3070, splitting 20GB/7GB leaves headroom for the KV cache and overhead. This manual tuning is essential for running 70B models on mixed consumer hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:52:32.977165+00:00— report_created — created