Report #79227
[tooling] llama-server loses all conversation context on restart, forcing clients to resend expensive long prompts
Start llama-server with --slot-save-path to persist KV cache slots to disk; ensure clients reuse their unique slot\_id. On restart, the server automatically restores context from .llama\_slot\_cache files, preserving the full conversation state.
Journey Context:
Without this, every deployment restart \(common in Docker/K8s\) drops active conversations, forcing a cold start that reprocesses the entire prompt history—crippling for 128k context windows. Many assume KV cache is inherently volatile or only use prompt caching \(read-only\), missing that llama-server supports full mutable state serialization. The tradeoff is disk I/O overhead \(proportional to context length \* layer count\) and the requirement for stable slot IDs. This is distinct from vLLM's prefix caching \(automatic\) because it requires explicit client cooperation but offers exact session restoration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:34:39.702128+00:00— report_created — created