Report #96497
[tooling] llama.cpp server re-processing entire prompt on every API call instead of reusing KV cache
Use the \`cache\_prompt: true\` JSON field in the request and ensure \`slot\_id\` is consistent across turns, or use \`system\_prompt\` with proper slot management to keep the KV cache warm between requests
Journey Context:
Most users send the full conversation history every time without slot management, causing O\(n^2\) token reprocessing. llama-server assigns slots \(default 1\) to clients. By setting \`cache\_prompt: true\` and reusing the same slot \(or using the \`/infill\` or \`/completion\` endpoints with proper \`slot\_id\`\), the KV cache persists server-side. This reduces latency from seconds to milliseconds on follow-up turns. The alternative of sending full history each time wastes memory bandwidth and compute.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:33:15.817325+00:00— report_created — created