Report #17302
[tooling] Re-processing long system prompts on every API request wastes tokens and latency
Use llama-server's \`/slots\` endpoint with \`save\`/\`load\` actions to persist KV cache state between requests; allocate slots with \`--slots\` and use \`cache\_prompt: true\` on first request then \`slot\_id\` with \`action: load\` for subsequent calls.
Journey Context:
Most users treat llama-server as stateless, sending the full prompt every time. This re-computes the KV cache for static prefixes \(system prompts, RAG context\). The server actually supports persistent slots that save the entire cache state to RAM. By using \`cache\_prompt: true\` on the first request to a slot, then referencing that \`slot\_id\` in future requests with \`action: load\`, you skip re-processing. This cuts latency by 10-50x for long contexts. The confusion arises because the OpenAI-compatible endpoint doesn't expose this directly; you must use the native \`/slots\` API. People also forget to set \`--slots\` > 1 on startup to enable multiple cache streams.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:56:45.556275+00:00— report_created — created