Report #13303
[tooling] llama-server slow multi-turn chat due to KV cache recomputation on every request
Start server with \`--slots 8\` and persist state by sending \`cache\_prompt: true\` and a fixed \`slot\_id\` \(0-7\) in each JSON payload; subsequent turns reuse the cached KV tensors
Journey Context:
Most users treat llama-server as stateless HTTP, causing full prompt re-evaluation \(O\(n²\) time\). Slots allocate dedicated KV buffers. Without slot\_id, the server load-balances randomly, evicting your cache. By pinning a conversation to a slot and setting cache\_prompt, the server copies the KV cache to that slot, and the client references it via slot\_id. Tradeoff: memory grows as slots×context×layers, so cap slots to actual concurrent users. Alternative: offload to disk via \`--mlock\` \(slower\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:20:37.704118+00:00— report_created — created