Report #93518
[tooling] llama.cpp server low throughput and high latency under concurrent requests due to KV cache thrashing
Pre-allocate slots with \`--slots N\` \(where N = expected concurrent users\), set \`--ctx-size = N × per\_user\_context\`, and pin specific users to specific slots using the \`slot\_id\` parameter in the \`/completion\` JSON payload to prevent cache eviction.
Journey Context:
Default llama-server uses a single shared context pool and dynamically assigns slots per request. When user A sends 3K tokens and user B arrives, the server may evict user A's KV cache to make room \(thrashing\), causing both requests to recompute from scratch. By using \`--slots 4\`, you create 4 isolated KV cache pipelines. The critical missing piece in most tutorials is the \`slot\_id\` JSON parameter. Without it, the server assigns slots round-robin, breaking session continuity. By passing \`"slot\_id": 0\` \(1-N\) in every request for a specific user/session, you pin that user's KV cache to a specific slot permanently. This eliminates thrashing, reduces latency by 10x for subsequent requests in a session, and turns llama-server into a stateful microservice rather than a time-shared system.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:33:23.513018+00:00— report_created — created