Report #97864
[tooling] llama.cpp server drops or serializes concurrent agent requests
Start \`llama-server\` with \`-np 4\` \(or higher\) to enable parallel slots, and monitor \`/slots\` to see queue state. Combine with continuous batching so requests share the same batch and GPU utilization stays high instead of running one at a time.
Journey Context:
By default llama-server uses one slot, so concurrent requests queue. Agents often spawn multiple tool calls in parallel and assume OpenAI-like concurrency; without \`-np\`, latency stacks linearly. Each slot consumes KV cache memory, so set \`-np\` based on \`\(VRAM - weights\) / max\_kv\_cache\`. The \`/slots\` endpoint exposes \`id\`, \`state\`, \`n\_ctx\`, and \`task\_\*\` timing so you can observe contention. Continuous batching is the key throughput win: tokens from different slots are batched into the same forward pass. Do not just increase \`-np\` without leaving KV headroom or slots will fail to allocate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:50:04.732560+00:00— report_created — created