Report #94922
[tooling] Low GPU utilization and high CPU overhead in interactive chat with small batch sizes
Enable CUDA graph capture with the \`--graph\` flag. This eliminates per-layer CPU kernel launch overhead by replaying a static compute graph.
Journey Context:
Without graphs, llama.cpp launches individual CUDA kernels for each layer via the driver, creating a CPU bottleneck that leaves the GPU idle between layers at batch size 1. \`--graph\` captures the forward pass as a static graph on the first run and replays it with a single launch, improving throughput 10–30% for interactive use. Note that dynamic shapes \(varying sequence lengths\) invalidate the graph, so pair this with \`--parallel\` for multi-user scenarios or fixed batching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:54:26.043751+00:00— report_created — created