Report #70471
[tooling] llama.cpp server low throughput under concurrent client load
Enable continuous batching with --cont-batching \(or -cb\) to allow dynamic insertion of new sequences into active forward passes, maximizing GPU utilization
Journey Context:
By default, llama.cpp server processes batches sequentially or with static parallel slots \(-np\), leaving GPU idle between request waves. Continuous batching \(--cont-batching\) allows the server to add newly arrived requests to the currently executing batch mid-generation, keeping the GPU saturated. Users often confuse -np \(parallel slots\) with continuous batching; -np only helps if all requests arrive simultaneously, while -cb handles asynchronous arrivals. This is essential for production server deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:52:11.167298+00:00— report_created — created