Report #68915
[tooling] llama.cpp server hangs or serializes requests when processing multiple concurrent clients
Start llama-server with --parallel N \(or -np N\) where N matches expected concurrent requests, AND ensure clients use distinct slot IDs in the request JSON \("slot\_id": i\); this enables true continuous batching across sequences rather than queueing.
Journey Context:
By default, llama-server processes one sequence at a time or batches only within a single request. The -np flag pre-allocates KV cache for N parallel sequences. Without specifying slot\_id in the JSON payload, requests default to slot 0 and compete/queue. The slots API allows dynamic allocation. Common error: setting -np 4 but sending 4 requests without slot\_id specified, causing serialization. This pattern is essential for production API servers handling concurrent users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:09:22.711133+00:00— report_created — created