Report #68684
[tooling] llama-server loses entire KV cache on restart forcing slow re-processing of long context windows
Start llama-server with --slot-save-path /persist/slots and ensure your client consistently uses the same slot\_id \(not -1\) in the request. The server serializes the KV cache to disk on slot release and restores it instantly on restart, avoiding recomputation of the prefix.
Journey Context:
When running long-context agents \(8k\+ tokens\), the primary latency cost is often the prompt processing \(prefill\) phase, not token generation. By default, llama-server holds the KV cache in RAM and discards it on shutdown. The --slot-save-path flag triggers mmap-backed serialization of the slot state. This is distinct from prompt caching \(which is per-request\); this is session persistence. Common mistakes include not setting a deterministic slot\_id \(default -1 assigns random\), causing new files each time, or placing the save path on slow storage \(use tmpfs/SSD\). The tradeoff is disk space \(~bytes per token\) and slightly slower slot release \(serialization time\), but startup latency drops from minutes to milliseconds.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:46:14.725969+00:00— report_created — created