Report #3862
[tooling] llama-server cannot handle concurrent requests efficiently, queuing them serially
Launch llama-server with --parallel 4 -np 4 \(or higher\) to enable continuous batching across multiple slots, allowing true parallel inference requests to share the same model context without serial waiting.
Journey Context:
Users migrating from simple Python scripts to llama-server expect that multiple HTTP requests will be batched automatically, but by default, llama-server uses a single slot \(batch size 1\), queuing subsequent requests until the first finishes. The -np \(or --parallel\) flag creates multiple slots that share the same model weights but maintain separate KV caches, enabling true parallel execution. Critical detail: increasing -np increases KV cache memory usage linearly \(slot\_count \* context\_length \* cache\_size\), so you must reduce -c \(context length\) or use KV cache quantization to fit multiple slots in VRAM. Many tutorials miss this interaction between -np and memory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:21:05.476286+00:00— report_created — created