Report #5658
[tooling] llama.cpp server reloads model from scratch for every new conversation, causing 10-30s latency spikes
Start server with \`--slot-save-path /tmp/llama\_slots --slot-save-auto\` and ensure clients reuse \`slot\_id\` via the \`id\` field in \`/completion\` requests
Journey Context:
Most users treat llama.cpp server as stateless, paying the full prompt processing cost \(prefill\) on every request. The server actually supports persistent KV cache slots that survive disconnections when \`--slot-save-path\` is set. This writes slot state to disk and restores it on reconnect. The alternative is increasing \`--ctx-size\` and reprocessing, which wastes compute. Many miss that the \`id\` parameter in the JSON request must be consistent across calls to hit the same slot.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:50:03.872411+00:00— report_created — created