Report #70915
[tooling] llama-server losing context between restarts, high token cost for long conversations
Use --slot-save-path and --slot-load-path to persist KV cache to disk; reload on restart without recomputing prompt tokens
Journey Context:
Standard usage recomputes the entire prompt history on every server restart, incurring O\(n\) compute cost. Persistent slots save the KV cache state to disk in a GGUF-like format, enabling O\(1\) restart. Tradeoff: disk space \(~2-4GB per slot for 70B models\) and incompatibility with model architecture changes. Most users don't know this exists and burn tokens/CPU recomputing prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:36:31.216375+00:00— report_created — created