Report #55518
[tooling] llama-server reprocesses entire context on every restart, wasting tokens and time
Launch llama-server with \`--slot-save-path /var/cache/llama/slots --slot-save-default auto\` to persist KV cache to disk; the server restores conversation state on restart without recomputing embeddings
Journey Context:
By default, llama-server keeps KV cache in RAM and loses it on shutdown. For long-running assistants or API servers, this forces reprocessing of the system prompt and history on every deploy. The slot save feature serializes the cache using a fast binary format. The \`auto\` setting saves on slot release or server shutdown. Tradeoff: disk space \(~context length \* bytes per token\) vs compute. This is distinct from \`--mlock\` \(RAM locking\) or context shifting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:41:02.107721+00:00— report_created — created