Report #17810
[tooling] llama.cpp server loses all conversation state on restart requiring full model reload
Use \`--slot-save-path /path/to/states --slot-save-auto\` flags to persist KV cache and tokens to disk; server resumes instantly without reloading weights
Journey Context:
Production deployments restart for updates or crash; reloading a 70B model from disk takes minutes and evicts other GPU workloads. Most users don't know llama.cpp can serialize the full slot state \(prompt, generated tokens, KV cache\) to files. The tradeoff is disk space \(roughly context length × KV cache size per layer\). Alternatives like Redis external cache don't exist in llama.cpp; this is the only native solution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:24:33.034604+00:00— report_created — created