Report #93936
[tooling] llama-server API serializes concurrent requests instead of processing them in parallel, causing high latency under load
Start llama-server with \`--slots N\` \(e.g., \`--slots 4\`\) to enable continuous batching; this allocates separate KV cache slots for N parallel sequences, allowing true parallel inference with shared model weights, increasing throughput linearly up to VRAM limits.
Journey Context:
By default, llama-server uses a single slot \(batch size 1\), processing requests sequentially even if 10 clients connect simultaneously. The \`--slots\` parameter pre-allocates N independent KV caches \(each sized by \`--ctx-size\`\), enabling the server to batch tokens from all active sequences together \(continuous batching\). This shares the memory bandwidth cost of loading weights across all active requests. Critical detail: each slot consumes \`2 \* n\_layers \* n\_heads \* head\_dim \* ctx\_size \* sizeof\(dtype\)\` bytes of VRAM. For a 70B Q4 with 4k context, one slot is ~2GB; four slots needs 8GB extra. Users often forget to increase \`--ctx-size\` alongside slots, causing context truncation. Alternative is running multiple separate instances, but that duplicates weight storage in VRAM, preventing multi-user scenarios on limited hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:15:32.370445+00:00— report_created — created