Report #63630
[agent\_craft] Agent incurs high latency and cost on every turn of a multi-turn conversation because the full system prompt \(tools, rules, examples\) is re-processed from scratch each time
Utilize 'prompt caching' or 'prefix caching' offered by inference providers \(e.g., Anthropic's prompt caching, DeepSeek's context caching, or KV-cache reuse in vLLM\). Structure the prompt with a static prefix \(system prompt \+ tool definitions\) that remains identical across turns, enabling the KV cache to be reused for those tokens.
Journey Context:
In agent loops, the system prompt and tool definitions are identical every turn, only the conversation history changes. Without caching, the LLM re-encodes these thousands of tokens repeatedly. Modern inference engines \(Anthropic Claude 3.5, DeepSeek v2, vLLM\) support prompt caching where KV vectors for the initial static prefix are computed once and reused. This requires structuring the API call to mark the cacheable prefix \(e.g., \`cache\_control: \{type: 'ephemeral'\}\` in Anthropic\). The tradeoff is that cached prompts often have higher storage cost but much lower per-token input cost on cache hits, and 50-90% latency reduction on subsequent turns. Many developers don't realize this exists and pay 10x latency on multi-turn agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:17:30.140753+00:00— report_created — created