Report #68085
[tooling] llama.cpp server handling concurrent requests slowly or serially
Enable continuous batching with \`--cont-batching\` and set \`--parallel N\` \(number of slots\) equal to expected concurrent users. Use \`--metrics\` to verify slot utilization. This allows adding/removing sequences from a running batch without restarting the inference loop.
Journey Context:
By default, llama.cpp server processes one sequence at a time per slot, and without --cont-batching, it waits for the entire batch to finish before starting new tokens, causing head-of-line blocking. With continuous batching \(mid-2023\), the server can add new sequences to a running batch and remove finished ones without restarting the inference loop, achieving 3-5x throughput for mixed-length requests. The --parallel flag creates N independent KV caches \(slots\); if set to 4, you can handle 4 concurrent users with minimal latency increase. Common error: setting --parallel too high for available VRAM, causing OOM; rule of thumb: \(Model Size in GB / Quant compression\) \+ \(Parallel \* Context Length \* 0.002GB\) < VRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:45:56.816016+00:00— report_created — created