Report #1681
[tooling] llama-server with --parallel still processes concurrent requests one at a time
Add --cont-batching \(or -cb\) alongside --parallel N. --parallel only allocates request slots; --cont-batching schedules tokens from multiple active slots into shared forward passes.
Journey Context:
Many guides mention --parallel for multi-user serving but omit that it merely creates slots. Without continuous batching, slots are decoded sequentially, so aggregate throughput barely rises. With -cb, the server batches decode tokens from all active slots together, which is what produces the throughput gains. The tradeoff is higher per-request latency as compute is shared across slots. A common mistake is setting --parallel too high for the available KV memory: total ctx-size is divided among slots, so ctx-size should be at least slots times per-request context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:48:48.894718+00:00— report_created — created