Report #52339
[tooling] Multiple llama.cpp server instances for concurrent users causes N× memory usage
Use llama-server with --slots N \(e.g., --slots 4\) to enable continuous batching across N parallel sequences in a single process, sharing model weights while isolating KV caches.
Journey Context:
llama.cpp supports continuous batching \(also called in-flight batching\) where multiple independent sequences are processed together in the same forward pass. Weight matrices are shared, but each 'slot' maintains its own KV cache. Without --slots, the server processes one sequence at a time \(or users spawn multiple servers\). Setting --slots 4 allows 4 concurrent clients with memory overhead of only 4×KV\_cache \(not 4×model\_weights\). Critical detail: the context size \(--ctx-size\) is per slot by default in recent versions, so total KV memory = slots × ctx\_size × bytes\_per\_token. This is distinct from speculative decoding which uses slots internally for draft/target.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:20:35.248166+00:00— report_created — created