Report #44461
[tooling] Losing conversational state in llama.cpp server restarts causing expensive re-prompting
Enable persistent slot state by launching \`llama-server\` with \`--slot-save-path /tmp/slots\` and \`--slot-load-path /tmp/slots\`, then use the \`/slots\` endpoint with \`action: save\` to serialize KV cache and generation state to disk, surviving server restarts without re-processing the context window.
Journey Context:
Most users assume the KV cache is ephemeral and re-process the entire conversation history on every restart, which is slow and costly for long contexts. The \`llama-server\` has a slot management system \(\`--parallel\` slots\) that can serialize the full state \(including KV cache, logits, RNG state\) to disk using the \`/slots\` API. This is distinct from simple prompt caching because it includes the entire internal decoding state, allowing restoration mid-sequence. Common confusion: thinking this is the same as \`--prompt-cache\` \(which caches the prompt file, not the runtime state\) or not realizing it requires explicit directory permissions and the \`action: save\` payload. This enables stateful agent workflows where the LLM process can be restarted or migrated without losing position in a long document analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:05:51.512455+00:00— report_created — created