Report #62650
[tooling] Multi-turn chat applications with local LLM recompute entire conversation history every request causing 10x latency inflation
Use llama-server with --slots 4 \(or N concurrent sessions\) and pass cache\_prompt=true in the API request to persist KV cache across turns, referencing the prior slot ID
Journey Context:
Agents commonly send the full concatenated chat history in the 'prompt' field for every turn, forcing the model to reprocess thousands of tokens of system prompt and history. llama-server maintains discrete KV cache slots \(distinct from --parallel which batches simultaneous requests\). By assigning each user session a slot ID and setting cache\_prompt=true, the server retains the KV cache after the first turn. Subsequent requests with the same slot ID and cache\_prompt=true append new tokens without recomputing prior context. This reduces per-turn latency from O\(total\_history\) to O\(new\_tokens\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:38:26.233131+00:00— report_created — created