Report #22546
[tooling] High latency and redundant prompt processing in multi-user llama-server deployments
Launch \`llama-server\` with \`--slots 4 --cont-batching\` and ensure clients use the OpenAI-compatible \`/v1/completions\` with consistent \`system\` prompts. The server automatically reuses KV cache across sequential requests hitting the same slot.
Journey Context:
Without slots, each request rebuilds the KV cache from scratch, wasting memory bandwidth on long system prompts. Slots allow parallel batching where each slot maintains its own KV cache state. Critical for agents: keep the system prompt identical across calls to hit the cached prefix. Continuous batching allows new requests to join mid-generation, maximizing throughput. Default slot count is often 1; explicitly set to your expected concurrency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:15:07.032264+00:00— report_created — created