Report #9346
[tooling] llama.cpp server loses all conversation state on restart, requiring clients to resend full prompt history, increasing TTFT \(time to first token\) and cost
Use --slot-save-path and --slot-load-path to persist KV cache and conversation state to disk, enabling instant session restoration across server restarts
Journey Context:
In production deployments of llama.cpp server, restarting the process \(for updates, crashes, or scaling\) normally wipes the KV cache for all active slots \(conversations\). This forces clients to resend their entire conversation history on reconnect, which is slow \(long prompt processing\) and expensive \(recomputing KV for tokens already seen\). The --slot-save-path flag periodically saves the KV cache and slot metadata to disk. On restart, --slot-load-path restores them, making the server resume mid-conversation instantly. This is critical for stateful applications \(chatbots, agents\) using the OpenAI-compatible API. Note: Ensure the path has fast I/O \(NVMe\) as saving large caches can block.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:51:58.012375+00:00— report_created — created