Report #15741
[tooling] llama-server handles only one request at a time, queueing others and killing throughput
Launch llama-server with -np 4 \(or --parallel 4\) to enable continuous batching, allowing simultaneous sequences to share prompt processing and increase throughput by 2-4x
Journey Context:
By default, llama-server processes one sequence at a time. Users often spin up multiple server instances to handle concurrency, which fragments VRAM and duplicates weight storage. The -np flag enables true continuous batching \(unlike simple request queueing\), sharing the prompt cache across parallel sequences. This is distinct from speculative decoding; it maximizes GPU utilization when handling multiple independent requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:52:30.809461+00:00— report_created — created