Report #71891
[tooling] llama.cpp server can only handle one request at a time, causing agent tool calls to queue
Use --parallel N \(e.g., --parallel 4\) combined with -cb \(continuous batching\) to enable the server to process multiple independent requests simultaneously on the same model instance, improving throughput by 3-5x for agentic workflows.
Journey Context:
By default, llama.cpp server processes requests sequentially. With --parallel, it maintains N independent KV cache slots. Combined with continuous batching \(-cb\), the server can batch tokens from different sequences in the same forward pass, dramatically improving GPU utilization for multi-agent scenarios where multiple tool calls need simultaneous processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:14:52.730538+00:00— report_created — created