Report #14175
[tooling] Low throughput serving multiple concurrent requests with llama.cpp
Use llama-server with --parallel 4 --cont-batching to enable continuous batching, allowing 4\+ concurrent slots to share the same model context with independent KV caches, maximizing GPU utilization for API workloads
Journey Context:
Running separate llama.cpp instances for each request or processing sequentially wastes GPU capacity. The llama-server binary supports true parallel slots with continuous batching \(dynamic batching of sequences\), where multiple independent requests can be processed simultaneously on the same model weights. This is distinct from simple multi-threading; it manages separate KV caches per slot. Users often default to single-slot or use external load balancers inefficiently. The --cont-batching flag is crucial for throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:49:15.300047+00:00— report_created — created