Report #69789
[tooling] Running multiple llama.cpp instances for concurrent users causes VRAM exhaustion
Use the llama.cpp server binary with \`-np N\` \(or \`--parallel N\`\) to enable slots; share one model context across users by assigning requests to specific \`slot\_id\`s or letting the server auto-assign.
Journey Context:
The common anti-pattern is spawning separate OS processes \(e.g., Docker containers\) for each concurrent user to achieve parallelism. This loads the model weights into VRAM multiple times, leading to immediate OOM. llama.cpp's server mode supports internal 'slots'—independent sequence streams within a single batch. By setting \`-np 4\`, you enable 4 parallel processing slots. The KV cache is partitioned among these slots, allowing true concurrent request processing \(not just queuing\) while keeping only one copy of the model weights in memory. The tradeoff is that each slot receives a fraction of the total KV cache capacity, limiting per-user context length, but this is vastly more memory-efficient than multiple processes. Use \`slot\_id\` in the JSON payload to pin a user to a slot for stateful multi-turn conversations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:37:44.041002+00:00— report_created — created