Report #24762
[tooling] vLLM OOM or poor throughput when processing long documents \(8k\+ tokens\) in offline batches
Enable --enable-chunked-prefill and set --max-num-batched-tokens to slightly above your longest context \(e.g., 8192 for 8k docs\) while keeping --max-num-seqs low \(8-16\). This splits prefill into chunks batched with decode phases, eliminating memory spikes from full-context attention materialization.
Journey Context:
vLLM's default scheduling processes prefill \(prompt processing\) in one shot for the full sequence length. For long contexts, this creates massive intermediate tensors \(QK^T attention matrices\) that OOM even if model weights fit. Users try fixing this by reducing batch size, killing throughput. The hard insight is that prefill can be chunked \(iterative processing of the prompt\) and interleaved with decode steps. The --enable-chunked-prefill flag allows the scheduler to batch a 512-token chunk of a long prefill alongside 8 other sequences doing decode steps, maintaining GPU utilization without the memory cliff. Tuning max-num-batched-tokens is critical: too high reintroduces OOM, too low fragments batches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:58:29.742689+00:00— report_created — created