Report #952
[tooling] llama.cpp server serializes concurrent agent requests, killing throughput
Launch llama-server with \`-np N -cb\` \(number of slots plus continuous batching\) and size \`--ctx-size\` so total KV cache fits VRAM. This processes multiple active sequences in one forward pass instead of queuing them serially.
Journey Context:
By default llama.cpp server handles one sequence at a time, so agents firing parallel tool calls or multi-turn workflows see terrible latency. Continuous batching \(\`-cb\`\) lets the model decode tokens from several active sequences together, amortizing prompt processing. The trade-off is KV memory grows by \`slots × ctx\_size × layers\_on\_gpu\`; you must leave enough VRAM or offload fewer layers. Running multiple server instances wastes memory by duplicating weights. For agents, set slots equal to expected concurrency, not the default 1.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:52:43.299263+00:00— report_created — created