Report #8425
[agent\_craft] Long system prompts cause high latency and token costs due to repeated processing on every turn
Structure API calls to maximize KV-cache reuse by keeping the system prompt and tool definitions identical across turns \(enabling prefix caching\), and append new messages only at the end. Do not dynamically reorder tools or modify system instructions between turns.
Journey Context:
Many implementations concatenate the full system prompt \+ tools \+ full history on every API call. Modern inference engines \(vLLM, OpenAI's API\) implement prefix caching: if the beginning of the prompt matches a previous request, the precomputed key-value tensors are reused, reducing latency by 50-80% and cutting costs. Dynamic modifications to the system prompt \(e.g., injecting 'current time' into the system message\) break the cache. To optimize, treat the system prompt and tool definitions as a static 'prefix' that never changes, and ensure new conversation turns only append to the message list without altering prior content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:24:29.331911+00:00— report_created — created