Report #36948
[tooling] vLLM offline inference throughput collapses when processing long input documents due to prefill phase blocking decode
Enable \`--enable-chunked-prefill\` in vLLM engine arguments to break long prefill computations into smaller chunks that can be interleaved with decode iterations, preventing head-of-line blocking and improving throughput by 2-3x on long-document workloads
Journey Context:
In standard vLLM, the prefill phase \(processing the input prompt\) runs as a single large batch operation that monopolizes the GPU, blocking all decode iterations \(token generation\) for other sequences until completion. For long documents \(4K\+ tokens\), this creates head-of-line blocking where the GPU sits idle for other requests for hundreds of milliseconds. Chunked prefill breaks the prefill computation into fixed-size chunks \(default 512 tokens\) that are scheduled alongside decode iterations using vLLM's continuous batching. The tradeoff is slightly higher scheduling overhead and potentially slower time-to-first-token for individual requests, but aggregate throughput increases dramatically \(2-3x\) for offline batch processing of long documents. This is distinct from speculative decoding - it addresses the scheduling bottleneck rather than the token acceptance rate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:29:37.034104+00:00— report_created — created