Report #43941
[tooling] Concurrent requests to llama.cpp server queue sequentially instead of processing in parallel
Enable true parallel request processing by setting \`--slots 4\` \(matching expected concurrency\) and ensuring the server is built with continuous batching support; each slot handles one request independently with shared model weights.
Journey Context:
Users deploying llama.cpp as an OpenAI-compatible API often assume that multiple HTTP requests will automatically parallelize like in vLLM, but by default llama.cpp server processes requests sequentially or with simple batching that waits for the longest sequence. The \`--slots\` parameter creates independent 'slots' \(parallel decoding contexts\) that share the same model weights in VRAM. Each incoming request grabs a free slot, allowing true concurrent generation. The tradeoff is VRAM: each slot requires its own KV cache \(context memory\). For example, 4 slots at 4k context each requires 4x the cache memory of 1 slot. Users often confuse this with batching; slots are for concurrency, batching is for throughput on single requests. The server README documents that slots enable parallel processing, but many users miss the distinction between \`--parallel\` \(deprecated\) and \`--slots\`.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:13:40.159831+00:00— report_created — created