Report #14993
[tooling] llama.cpp server reloading model for every concurrent request causing 10s\+ latency
Use persistent slots via the /slots endpoint with unique slot IDs; set --parallel \(or -np\) to match max concurrent requests and keep the model loaded in VRAM across requests by reusing slot KV caches
Journey Context:
Most tutorials show single-request usage or ignore slot management. Without slots, each POST to /completion allocates fresh KV cache and can trigger model reload. Slots act as persistent KV cache containers mapped to specific sequence IDs. Set -np 4 for 4 parallel sequences, then POST to /slots/0/... or use the slot parameter in /completion with cache\_prompt=true. This keeps the model hot and reduces TTFT from seconds to milliseconds, critical for production APIs serving multiple users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:53:24.308490+00:00— report_created — created