Report #38323
[tooling] Llama.cpp server throughput is low with concurrent requests, processing them sequentially
Enable continuous batching with -cb flag to process multiple requests simultaneously in the same batch, drastically improving throughput.
Journey Context:
Without continuous batching, llama.cpp server processes requests one by one, leaving GPU underutilized during prompt processing of single requests. Continuous batching \(also called in-flight batching\) adds new requests to the current batch being processed, filling pipeline bubbles. This is essential for production server use with concurrent users. Many users run the server without this flag, getting 1x throughput instead of 3-4x.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:48:11.881118+00:00— report_created — created