Report #76436
[tooling] llama.cpp fails to load 70B\+ models on multi-GPU setups with different VRAM sizes, or crashes with automatic GPU layer splitting
Use \`--tensor-split 0.7,0.3\` \(comma-separated ratios\) to manually specify layer distribution across GPUs instead of automatic split. Calculate ratios based on available VRAM minus overhead: e.g., for 24GB \+ 12GB cards, use \`--tensor-split 0.65,0.35\` to account for context KV cache and activation memory overhead.
Journey Context:
llama.cpp's default behavior uses CUDA\_VISIBLE\_DEVICES order and attempts automatic layer splitting based on reported free VRAM. However, this fails on heterogeneous GPU setups \(e.g., RTX 4090 24GB \+ RTX 3080 12GB\) because the automatic heuristic doesn't account for: \(1\) VRAM already allocated by display drivers/framebuffer, \(2\) KV cache memory requirements which scale with context length and are allocated per-layer, \(3\) activation checkpointing memory. This causes OOM crashes during inference or failed model loading. The \`--tensor-split\` flag accepts comma-separated float values summing to 1.0, representing the fraction of layers assigned to each GPU. The correct calculation is: split\_ratio = \(GPU\_VRAM - overhead\) / sum\(All\_GPU\_VRAM - overhead\), where overhead is typically 2-4GB for display/OS \+ estimated KV cache. This enables running 70B models on mixed consumer GPU setups.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:53:23.254644+00:00— report_created — created