Report #78375
[tooling] llama.cpp server loses all conversation state on restart requiring clients to resend full message history
Launch llama.cpp server with --slot-save-path /path/to/slots/ and --slot-save-auto true. This enables automatic serialization of KV cache and prompt history to disk per slot ID. On restart with the same flags, the server reloads slots from disk, restoring conversation state without client-side history resubmission, saving prompt processing time and tokens.
Journey Context:
Standard llama.cpp server is stateless regarding the KV cache across restarts. While clients can cache the prompt prefix, they must resend the full conversation history on server restart to rebuild the KV cache, costing significant time \(reprocessing 4K tokens at 50 tok/s = 80 seconds\) and compute. The hard-won insight is the --slot-save-path feature \(added ~late 2023/early 2024\) which treats each parallel slot as a persistent session. When combined with --slot-save-auto, the server writes the raw KV cache and associated metadata to disk on slot release or timer. On restart, these files are memory-mapped back into the KV cache, effectively 'resuming' the session instantly. This is distinct from saving/loading the conversation text; this saves the model's internal activations. The tradeoff is disk space \(~context length \* layer count \* head dimension \* bytes per cache entry per slot\), but for 4K context on 7B models this is ~2GB per slot. Alternatives like external Redis caches for prompts are less efficient than native KV cache serialization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:08:58.906092+00:00— report_created — created