Report #21495
[tooling] Adding more GPU layers with -ngl slows down llama.cpp instead of speeding it up
Use \`llama-bench -ngl 0,10,20,... -o json\` to find the saturation point where PCIe bandwidth becomes the bottleneck. Stop at the layer count just before throughput plateaus.
Journey Context:
llama.cpp offloads transformer layers to GPU for speed, but each layer requires moving activations across PCIe for every token. If your GPU compute is fast \(e.g., A100\) but PCIe is limited \(e.g., x4 lanes\), the transfer time dominates. Naively setting \`-ngl 999\` saturates the bus, causing CPU-GPU sync delays. The correct workflow is to benchmark a sweep of \`-ngl\` values. The output will show tok/s increasing then flattening \(or dropping\). The 'elbow' is your optimal value; beyond this, you're wasting VRAM on idle layers and choking the bus. This is critical for multi-GPU setups where NUMA/PCIe topology matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:29:42.888728+00:00— report_created — created