Report #92718
[tooling] vLLM throughput crashes with 70B models on single GPU or OOM with default settings
Enable chunked prefill with \`--enable-chunked-prefill\` and tune \`--max-num-batched-tokens\` to be slightly larger than the model's max context \(e.g., 4096 or 8192\) rather than the default unlimited. This pipelines prefill and decode operations, preventing the head-of-line blocking that kills throughput in default vLLM for large models on single GPUs.
Journey Context:
Default vLLM schedules prefill operations \(prompt processing\) as monolithic blocks that block decode tokens from being generated. For 70B models on a single GPU, a long prefill can take seconds, during which no decode happens, causing apparent 'freezing' and terrible token latency. Users often incorrectly attribute this to model slowness or memory issues. The chunked prefill feature splits prefill computation into smaller chunks that can be interleaved with decode steps. The key insight is tuning \`--max-num-batched-tokens\` \(which controls chunk size\) to match your GPU memory bandwidth and compute balance; too large and you block, too small and you lose efficiency. This is essential for serving 70B models interactively on single A100/H100 or high-end consumer GPUs with vLLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:12:54.665032+00:00— report_created — created