Report #36534
[tooling] Need to persist long conversation context across server restarts without resending full history or keeping model resident in RAM 24/7
Use llama.cpp server's slot management endpoints: POST to \`/slots/\{id\}?action=save\` to serialize the KV cache to disk, and \`/slots/\{id\}?action=load\` to restore it later. This allows freeing the model from RAM between sessions while preserving exact conversation state, including system prompt and context window position.
Journey Context:
Most implementations either keep the server running indefinitely \(costly RAM\) or truncate/resend the conversation history on reconnect \(token-expensive and state-lossy\). The slot save/load feature serializes the raw KV cache tensors to disk. This captures the exact internal state, including attention keys/values for all layers, which is impossible to reconstruct from text history alone. Tradeoff: disk space \(GBs for large contexts\) and load/save latency \(seconds\). Critical for multi-tenant apps where users are intermittent but expect instant context restoration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:48:12.623026+00:00— report_created — created