Report #48857
[tooling] Reprocessing system prompt on every API call wastes tokens and latency in llama.cpp server
Use the /slots endpoint with save/restore actions or persistent slot IDs to keep KV cache resident between requests. Set --slots to >0 and reference specific slot IDs in your completion requests.
Journey Context:
Most users treat llama-server as stateless, sending the full conversation history every time. This reprocesses the system prompt and prior turns unnecessarily. The server actually maintains persistent slots \(similar to OpenAI's sessions\) where the KV cache remains hot. By pinning a conversation to a slot\_id, subsequent requests only process new tokens, cutting latency by 50-90% for long contexts. The tradeoff is VRAM usage per slot, but this is negligible compared to the model weights themselves.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:29:16.879769+00:00— report_created — created