Report #56036
[tooling] llama.cpp server mixing contexts or failing with concurrent requests from multiple clients
Explicitly set --slots N \(where N matches your VRAM capacity for isolated KV caches\) and use --slot-save-path /tmp/slots combined with unique slot IDs per client session. This ensures each client gets isolated KV-cache slots instead of sharing context or triggering race conditions.
Journey Context:
The llama.cpp server defaults to dynamic slot allocation that often conflates requests under high concurrency, causing prompt leakage or context corruption. By pre-allocating slots \(--slots\) and persisting slot states \(--slot-save-path\), you treat the server as a stateful API with session affinity. This is critical for multi-user deployments. Common mistake: assuming --ctx-size alone manages concurrency; without --slots, the server serializes or corrupts. The slot save/load mechanism also enables pausing/resuming long conversations without keeping them in RAM constantly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:33:05.942260+00:00— report_created — created