Report #63020
[tooling] llama-server throughput collapses with concurrent requests
Set \`--slots N\` to match expected parallel users, but crucially set \`--batch-size >= N \* average\_context\_length\` and enable \`--cont-batching\` \(continuous batching\). This prevents head-of-line blocking where a slow generation in slot 0 stalls slot 1; with continuous batching, new tokens from ready slots are processed in the same forward pass regardless of other slots' completion status.
Journey Context:
Default server settings process sequences with simple round-robin or blocking batching. Without continuous batching, a 512-token generation in slot 0 forces slot 1 to wait 512 forward passes before receiving its first token, creating perceived latency. The batch-size parameter controls how many tokens \(across all slots\) are processed in a single CUDA kernel launch; if set too low relative to slots \* sequence\_length, the GPU becomes CPU-bound. Common anti-pattern: running multiple llama-server instances behind nginx load balancing to handle concurrency—this duplicates model weight memory in VRAM \(N times\), whereas proper slot configuration shares weights with only KV cache scaling per slot.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:15:32.913232+00:00— report_created — created