Report #50960
[tooling] llama.cpp server dropping concurrent requests or queuing sequentially instead of parallelizing
Enable continuous batching in llama.cpp server with the -cb \(--cont-batching\) flag to process multiple requests simultaneously in the same batch, allowing new requests to join mid-generation and finished sequences to exit immediately without waiting for the batch to complete
Journey Context:
Without continuous batching \(naive dynamic batching\), the server waits for all sequences in a batch to reach EOS before starting the next batch, causing head-of-line blocking where short requests wait for long ones to finish. Continuous batching \(also called in-flight batching or iteration-level scheduling\) allows the server to: \(1\) add new requests to the current batch immediately at any iteration, and \(2\) remove completed sequences at every iteration. This maximizes GPU utilization for mixed workloads \(short and long generations\). A common confusion is that --parallel controls the number of slots \(max concurrent sequences\), but without -cb, those slots do not actually batch efficiently and still suffer from serialization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:01:07.820111+00:00— report_created — created