Report #93703
[tooling] Re-loading a 70B model into VRAM takes minutes; losing conversation context when the client disconnects; hitting context length limits \(e.g., 4096 tokens\) and losing the beginning of the document when using context shifting
Use llama.cpp's server mode \(\`llama-server\`\) with the \`--slot-save-path \` and \`--slot-load-path \` CLI flags \(or the HTTP POST \`/slots//save\` and \`/slots//restore\` endpoints\) to persist the KV cache \(including the full token history and context state\) to disk; this allows: \(1\) resuming long conversations instantly without reloading the model or re-processing the prompt, \(2\) implementing 'infinite' context by saving slots at intervals and restoring them as needed, \(3\) server-side session persistence across client reconnections
Journey Context:
Most users treat llama.cpp as stateless \(main.exe\) or use the server without slot management; the slot mechanism is designed for multi-user concurrency but the save/restore feature is underdocumented; the KV cache contains the processed state of all previous tokens, so saving it avoids recomputing attention for the entire history; this is distinct from context shifting \(which discards old tokens\); by saving slots to NVMe \(fast sequential write\), a 32k context 70B model's KV cache \(~2GB\) saves in seconds; alternatives like Redis for state are slower; this is the only way to achieve 'infinite context' with limited VRAM \(swapping slots in/out\)
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:52:00.793930+00:00— report_created — created