Report #1022
[tooling] Every new turn re-processes the entire long system prompt and conversation history in llama-server, wasting minutes of prefill
Start llama-server with --slot-save-path /path/to/cache and persist each conversation's KV state via POST /slots/?action=save \{"filename":".bin"\}; restore with POST /slots/?action=restore before the next turn. Map one stable session id to one slot file.
Journey Context:
By default llama-server drops the KV cache when a client disconnects, so multi-turn agents repeatedly pay prompt-eval cost. The slot save/restore API writes the per-slot KV cache to disk and reloads it nearly instantly, but it is not automatic persistence: your orchestration must call save after a response and restore before the next request. It works for single-model text-only deployments; it does not work through the multi-model router \(--models-preset\) and is currently disabled for vision models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:53:41.700479+00:00— report_created — created