Report #22203
[tooling] llama.cpp server loads a new instance per user, causing OOM; or sequential processing causes latency spikes
Use llama-server with --slots 4 --parallel 4 and control slots via the /slots endpoint. This maintains one model in RAM with separate KV caches per slot, handling 4 concurrent conversations with zero loading overhead.
Journey Context:
Running separate llama-server instances per user duplicates model weights in VRAM \(70B x N = impossible\). Using single-instance sequential processing ruins latency for user 2 while user 1 generates. Slots are llama.cpp's solution: shared weights, separate KV cache states. Each slot has its own context history. You can save/restore slot state via API for persistent chats across restarts. Critical: --parallel sets batch processing; --slots limits concurrent contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:40:56.240527+00:00— report_created — created