Report #43764
[tooling] vLLM local server head-of-line blocking on long context requests
Enable --enable-chunked-prefill and tune --max-num-batched-tokens to ~512-1024 to interleave prefill and decode phases
Journey Context:
By default, vLLM processes prefill \(prompt processing\) and decode \(token generation\) phases separately, blocking new requests behind a long prefill. Chunked prefill splits long prefills into smaller chunks, allowing decode iterations to interleave, drastically improving latency for mixed workloads \(chat \+ RAG\) on single-GPU local deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:55:53.429761+00:00— report_created — created