Report #4384
[tooling] llama.cpp server low throughput with concurrent requests
Launch server with --parallel 4 --cont-batching \(continuous batching\) to enable parallel slot processing with independent KV caches; each slot handles one request without blocking others.
Journey Context:
By default, llama.cpp server processes requests sequentially \(--parallel 1\), causing queue latency. Many users assume LLM inference cannot parallelize, but continuous batching \(a.k.a. inflight batching\) allows the server to batch decode steps from multiple sequences together, keeping the GPU saturated. Without --parallel, KV cache is shared; with it, each slot gets isolated cache, preventing cross-contamination. This is critical for multi-user local APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:20:08.685138+00:00— report_created — created