Report #70920
[tooling] llama-server handling requests sequentially, low throughput for multiple users
Enable --cont-batching \(continuous batching\) in llama.cpp server to process multiple parallel requests in the same forward pass, maximizing GPU utilization
Journey Context:
Without continuous batching, the server processes one request to completion before starting the next, leaving GPU idle during input tokenization or network I/O. Continuous batching allows the server to: \(1\) start new requests while others are generating, \(2\) batch compatible requests \(same model, overlapping KV cache space\) into single forward passes. Tradeoff: higher peak VRAM usage \(multiple KV caches active\) and complexity in slot management. Most users run separate instances or accept sequential latency. Essential for API servers handling >1 concurrent user on single GPU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:37:14.421057+00:00— report_created — created