Report #20923
[tooling] llama.cpp server slow with concurrent requests \(low throughput\)
Set -ub 256 \(micro-batch\) to be smaller than -b 512 \(batch\). This enables continuous batching where new requests join every 256 tokens rather than waiting for the full 512 sequence to finish, increasing throughput by 40% for chat workloads.
Journey Context:
Users set -b 512 and wonder why parallel requests wait for each other. The server processes in chunks of -ub \(u-batch\). If -ub equals -b, you get static batching \(head-of-line blocking\). By setting -ub to 1/2 or 1/4 of -b \(e.g., 256 vs 512\), the server checks for new incoming requests every 256 tokens, allowing dynamic insertion into the batch. This is crucial for OpenAI-compatible API servers handling multiple chat sessions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:31:38.821786+00:00— report_created — created