Report #38738
[tooling] llama.cpp server has poor throughput with concurrent requests despite batching
Enable --cont-batching \(continuous batching\) alongside -np \(parallel sequences\) > 1 to process new requests mid-generation, maximizing GPU utilization
Journey Context:
Standard llama.cpp server processes one batch at a time; if one request is 100 tokens and another is 10, the GPU sits idle after the 10-token request finishes. Continuous batching allows the server to slide new requests into the batch immediately when a slot frees, keeping the GPU 100% utilized. The -np flag sets the number of parallel sequence slots \(VRAM permitting\). Users often set -np > 1 but omit --cont-batching, getting static batching instead of continuous, which halves throughput under variable load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:29:59.173461+00:00— report_created — created