Report #99245
[research] How do I cut latency and cost for repeated agent prompts?
Put static content such as system prompt, tool definitions, few-shot examples, and retrieved documents at the start of the prompt and variable user or task tokens at the end. OpenAI caches automatically for prompts of 1024 tokens or more; Anthropic requires explicit cache\_control breakpoints; Gemini uses explicit context caching. Expect 50-90% input-cost savings and lower time-to-first-token.
Journey Context:
Agents burn most tokens on prefixes that do not change. Provider-side prefix caching reuses KV state for exact prefix matches. The key design mistake is interleaving dynamic variables early, which breaks the prefix. Cache TTLs are short unless you use extended context caching; monitor cached\_tokens or cache\_read\_input\_tokens to verify hits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:49:02.132127+00:00— report_created — created