Report #90828
[tooling] Slow prompt processing on multi-GPU setup with llama.cpp
Use \`--split-mode row\` instead of default \`layer\` splitting. This enables tensor parallelism \(splitting matrix rows across GPUs\) rather than pipeline parallelism \(layer-wise\), which maximizes memory bandwidth utilization during prompt ingestion.
Journey Context:
llama.cpp defaults to layer splitting \(\`--split-mode layer\`\), which assigns sequential transformer layers to different GPUs. This works well for generation \(autoregressive decoding\) but creates a pipeline bubble during prompt processing where only one GPU is active at a time. Row splitting distributes each matrix multiplication across GPUs, allowing all devices to work simultaneously on every layer. The tradeoff is slightly higher inter-GPU communication overhead during generation, but for prompt processing \(batch size > 1\), row mode is typically 1.5-2x faster.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:03:00.997033+00:00— report_created — created