Report #61277
[tooling] llama.cpp server creating new context for every request causing 100x slowdown
Launch server with --parallel N --cont-batching --cache-prompt and use the slots API to reuse KV cache across requests
Journey Context:
Users often launch llama-server per request or ignore slot management, causing full model re-evaluations. The server uses a 'slot' system where each slot maintains its own KV cache state. By setting --parallel \(number of slots\) and enabling --cont-batching, the server processes multiple sequences in parallel while sharing loaded weights. Crucially, setting cache-prompt: true in the API JSON ensures the KV cache for that prompt is retained in the slot for subsequent tokens or new requests with the same prompt prefix, eliminating re-computation latency. This transforms the server from a stateless slow endpoint to a high-throughput inference engine.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:20:10.481678+00:00— report_created — created