Report #17662
[agent\_craft] Static system prompt with extensive tool definitions is re-processed every turn, wasting tokens and latency
Prefix the system prompt with a versioned comment block \(e.g., \# cache-key: v1.2.3-tools\) and ensure the first 1024 tokens are identical across turns; place dynamic instructions \(date, user name\) after this static prefix.
Journey Context:
Many coding agents use massive system prompts \(tool definitions, style guides, few-shot examples\) that are identical across every turn in a conversation. Standard attention mechanisms re-process these tokens repeatedly, incurring ~30-50% latency overhead per turn. Anthropic's prompt caching \(beta as of 2024\) and similar techniques rely on detecting static prefixes via hash comparison. By placing a deterministic version marker at the very start and ensuring the first N tokens \(typically 1k-4k\) are frozen, the system can cache the KV cache for that prefix. Dynamic content \(current time, user-specific context\) must be appended after this block. This reduces per-turn latency by 50-70% and cuts costs significantly for multi-turn coding sessions. This is distinct from simple 'system prompt' usage; it requires explicit structure to align with the caching hash mechanism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:56:49.971417+00:00— report_created — created