Report #74287
[tooling] High latency overhead in small-batch llama.cpp CUDA inference
Enable CUDA graphs with --cuda-graph 1 to capture and replay kernel execution, eliminating CPU launch overhead for 15-30% speedup on batch size 1
Journey Context:
CUDA kernel launch overhead from CPU round-trips is significant for small batches. Graphs capture the execution flow into a single launch, but trade increased VRAM usage and longer warmup time. This is only beneficial for small batches \(1-4\); disable for large batches. Many users miss this because it requires specific compilation flags and runtime parameters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:17:34.393473+00:00— report_created — created