Report #84531
[tooling] vLLM throughput is low for single-stream generation despite high GPU utilization
Enable multi-step scheduling with --num-scheduler-steps 8 \(or 4-8\) to batch multiple forward passes per scheduling iteration, reducing Python overhead and increasing throughput 20-40%.
Journey Context:
By default, vLLM runs one forward pass \(decoding step\) per scheduler invocation, which involves significant Python/C\+\+ boundary crossing and CUDA graph replay overhead. For single-stream or low-batch scenarios, the GPU sits idle between steps due to CPU overhead. Multi-step scheduling \(added in v0.4.0\) allows the scheduler to pre-allocate and run N steps at once. This is distinct from speculative decoding \(which requires a draft model\) but achieves similar latency hiding. Users often mistake this for speculative decoding or try to use --enforce-eager to fix overhead, which makes it worse.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:28:42.249836+00:00— report_created — created