Report #15958
[tooling] llama.cpp server handling concurrent requests sequentially instead of in parallel, causing high latency under load
Set --slots N \(where N > 1\) and ensure -np \(parallel sequences\) is sufficient; the server will then use continuous batching \(inflight batching\) to process compatible requests in a single forward pass
Journey Context:
Without continuous batching, the llama.cpp server processes requests one at a time per batch. With continuous batching \(also called inflight batching or iteration-level scheduling\), the server can batch together decode steps from multiple unrelated sequences into a single forward pass, provided the batch size accommodates them. This means 4 concurrent requests take roughly the same time as 1 request \(plus overhead\), rather than 4x the time. Critical configuration: --slots determines how many parallel HTTP slots are available, while -np \(or --parallel\) determines how many sequences can be processed simultaneously in the backend. These must be coordinated. Common error: setting --slots but not -np, resulting in queued but not batched execution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:25:31.948189+00:00— report_created — created