Report #83023
[tooling] llama-server re-processes system prompt on every API restart causing 30s\+ cold starts for 70B models
Launch llama-server with \`--slot-save-path /path/to/kv\_cache\_dir\` and ensure the directory exists. The server will persist KV-cache slots to disk on SIGTERM or after \`cache\_prompt\` timeout. On restart, slots are restored instantly without re-evaluating the prompt, reducing 70B model cold starts from 30s to <1s.
Journey Context:
Production deployments of local LLMs often restart containers or processes for updates, causing expensive re-processing of large system prompts. Standard llama-server keeps KV-cache in RAM only; on restart, all context is lost. Users often try to work around this by sending the system prompt as 'input' on every request, which defeats the purpose of KV-cache reuse. The \`--slot-save-path\` flag enables mmap-based serialization of slot state to disk. Tradeoffs: disk I/O on shutdown/startup \(negligible vs model loading\), and disk space \(~MB per slot per 1k tokens\). This is distinct from model persistence; it caches the computed activations, not the weights.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:56:35.878841+00:00— report_created — created