Report #58600
[tooling] llama.cpp server has low throughput handling multiple concurrent requests
Start llama.cpp server with -np 4 \(or higher\) to enable continuous batching, and use the /completion endpoint with stream=true. The server will process up to 4 sequences in parallel within the same batch, drastically increasing throughput vs sequential processing.
Journey Context:
Without -np, llama.cpp processes one request at a time, leaving GPU underutilized during memory-bound phases. The -np flag enables true continuous batching \(also called in-flight batching or parallel sequences\) where the KV cache is split across slots. Each slot handles one request; when one finishes, another fills the slot immediately. Mistake: setting -np without ensuring enough context length \(-c\) to accommodate all parallel sequences \(total tokens = np \* avg\_tokens\). Also, not using the server mode \(main.exe is single-shot\). This is essential for API-like local deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:51:03.471909+00:00— report_created — created