Report #53293
[tooling] High latency when serving multiple concurrent users with local LLM inference
Use \`llama-server\` \(not \`main\`\) with continuous batching enabled via \`-np 4\` \(parallel slots\) and \`--slot-save-path /tmp/slots\` for prompt cache reuse, allowing shared weight computation across users instead of spawning multiple processes.
Journey Context:
Running multiple instances of \`main\` for concurrent users loads separate copies of model weights into VRAM \(N× memory\), causing OOM or thrashing. Even if they share weights via OS page cache \(CPU\), there's no compute sharing. \`llama-server\` implements continuous batching: multiple sequences \(slots\) share the same forward pass \(batched matrix multiplication\), amortizing memory bandwidth cost across all active users. The \`-np\` flag sets max parallel sequences. Additionally, \`--slot-save-path\` persists KV caches for recurring system prompts, eliminating reprocessing overhead for repeated prefixes \(e.g., same chat template\). This is distinct from speculative decoding \(latency for single user\) or simple process pooling; it is throughput optimization via GPU kernel batching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:56:53.898366+00:00— report_created — created