Report #85678
[tooling] llama.cpp multi-GPU has uneven VRAM usage causing OOM on one GPU while others have free memory
Use --tensor-split to manually specify the ratio of layers per GPU \(e.g., --tensor-split 0.6,0.4\) instead of relying on --split-mode layer, which evenly distributes layers without accounting for VRAM differences or overhead.
Journey Context:
llama.cpp supports multi-GPU via CUDA/ROCm. By default \(--split-mode layer\), it divides layers as evenly as possible across GPUs. However, this ignores that: 1\) Different GPUs have different VRAM capacities \(e.g., 24GB \+ 12GB\). 2\) The KV cache and compute buffers also consume VRAM, not just weights. 3\) The first GPU \(ID 0\) often handles embeddings and final logits, needing extra memory. --tensor-split allows explicit fractional allocation \(must sum to 1.0\). For mixed GPUs, calculate based on \(available VRAM - overhead\). Common mistake: splitting by layer count \(e.g., 40 layers on GPU 0, 40 on GPU 1\) without accounting for the embedding overhead on GPU 0. Alternative is --split-mode row \(tensor parallelism\), but that requires NVLink/fast interconnect and is rarely faster for inference due to communication overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:24:01.755371+00:00— report_created — created