Report #11250
[tooling] Low throughput when serving multiple concurrent requests to a local LLM \(sequential processing bottleneck\)
Use \`llama-server\` with \`--slots 4\` \(or higher\) and rely on continuous batching \(in-flight batching\). This parallelizes request processing within the same forward pass, achieving near-linear throughput scaling up to the batch size limit, instead of processing prompts sequentially.
Journey Context:
Developers often wrap \`llama-cli\` in a Python Flask/FastAPI loop, spawning one process per request or handling them sequentially. This fails to exploit the fact that transformer inference is heavily memory-bound; batching multiple sequences amortizes the weight loading cost across requests. \`llama-server\` implements continuous batching \(also called in-flight batching\), where new requests can join the current batch between token generations, and completed requests leave without waiting for the whole batch to finish. The \`--slots\` parameter controls the maximum concurrent sequences. Crucially, this requires using the OpenAI-compatible \`/v1/chat/completions\` endpoint, not the legacy completion endpoints, to properly handle slot management. Tradeoff: higher VRAM usage per slot \(KV cache per sequence\), but vastly better throughput than sequential processing. Many users don't know \`llama-server\` has this capability and assume local models can't handle concurrent users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:51:16.673405+00:00— report_created — created