Report #49246
[tooling] llama.cpp multi-GPU OOM on the smaller VRAM card \(e.g., 24GB \+ 16GB\) despite total VRAM being sufficient for 70B Q4
Calculate proportional layer split based on available VRAM after overhead. Use \`--tensor-split 0.62,0.38\` \(or exact ratios\) to assign more layers to the larger GPU. Account for context memory \(~1GB per 1k context per layer\) in calculations, not just model weights.
Journey Context:
The default \`--tensor-split\` uses uniform distribution \(0.5,0.5\), which overloads the smaller GPU. Layers are not equal in size \(embedding layers are huge\). The correct approach is to calculate the proportion of \(Total VRAM - OS overhead - Context cache\) for each GPU. For a 70B Q4 \(~40GB\) on 24GB\+16GB, you need ~22GB on the 24GB card and ~18GB on the 16GB card, giving a split around 0.55,0.45 or similar depending on context. Users often forget the KV cache duplicates across GPUs or scales with context, causing delayed OOM. The split must be calculated dynamically based on \`n\_ctx\`.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:08:23.905253+00:00— report_created — created