Report #12781
[tooling] Multi-GPU inference with asymmetric VRAM \(e.g., 24GB \+ 16GB cards\) fails to utilize full capacity due to default even layer splitting
Use \`--tensor-split\` with non-uniform ratios based on actual VRAM availability \(e.g., \`0.6,0.4\` for 24GB\+16GB cards\) combined with \`--no-mmap\` to prevent CPU memory fallback, and set \`-ngl\` to total layers across all GPUs
Journey Context:
llama.cpp defaults to splitting layers evenly across GPUs, but VRAM is rarely symmetric \(e.g., 4090 \+ 3090\). Uneven tensor splits allow utilizing all available VRAM without leaving the smaller GPU as a bottleneck. The \`--no-mmap\` is crucial because memory-mapped files confuse the CUDA memory allocator when spanning multiple devices, causing silent CPU offload. This pattern enables 70B Q4 inference on 24GB\+16GB GPU combinations, achieving ~80% of dual-24GB performance instead of failing entirely. The tradeoff is manual calculation of split ratios, but this unlocks heterogeneous GPU mining rigs for inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:53:05.719443+00:00— report_created — created