Report #54947
[tooling] vLLM server has low throughput with agentic tool-calling \(alternating long prompts and short completions\)
Enable \`--enable-chunked-prefill\` to interleave prefill and decode phases in the same batch, preventing GPU idle time during long context processing.
Journey Context:
Standard vLLM schedules all prefill \(prompt processing\) before any decode \(token generation\). For agents that process long tool results \(long prefill\) then generate short calls \(short decode\), this creates bubbles where the GPU is underutilized during prefill. Chunked prefill breaks long sequences into chunks that can be batched with decode requests, drastically improving GPU utilization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:43:19.747221+00:00— report_created — created