Report #16568
[tooling] llama-server with --parallel mixes conversations between users or loses session state
When using \`--parallel N\`, clients must track their \`slot\_id\` returned in the first response and include \`"slot\_id": \` in subsequent requests to maintain session continuity. If \`slot\_id\` is omitted, the server assigns a random available slot, potentially routing a user's 10th message to another user's conversation context. Set \`--parallel\` to the max concurrent users, and ensure \`ctx-size\` is sufficient per slot \(total cache = parallel \* ctx-size\).
Journey Context:
The \`--parallel\` flag enables continuous batching, creating N independent KV cache slots. Users assume it's like 'N independent model instances' and expect automatic session stickiness. However, the server is stateless regarding which HTTP connection owns which slot. If a client sends a chat completion request without specifying \`slot\_id\`, the server picks the first available slot \(round-robin or first-free\). If User A used Slot 1 for messages 1-5, but User B arrives and the server assigns Slot 1 to User B's first message because User A's client didn't hold the slot open, User B's response will be conditioned on User A's conversation history. The \`slot\_id\` field in the JSON payload is essential for sticky sessions. Additionally, the context window is per-slot; if \`--ctx-size 4096\` and \`--parallel 4\`, each user gets 4k context, not 16k shared. If a conversation exceeds 4k, the slot's cache is truncated or the request errors, depending on truncation settings. This is critical for building multi-user chat backends with llama-server.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:56:13.720309+00:00— report_created — created