Report #41475
[cost\_intel] Not using prompt caching on repeated system prompts and tool definitions in agentic loops
Enable prompt caching for any workflow where the same prefix \(system prompt \+ tool definitions \+ conversation history\) exceeds 1024 tokens and is reused across ≥2 API calls. Anthropic cached tokens are 90% cheaper \($0.30/M vs $3/M input for Sonnet\). Google Vertex AI caches at ~75% discount.
Journey Context:
Developers skip prompt caching because they think of each API call as independent. But in agentic workflows \(ReAct loops, multi-turn tool use, conversational AI\), the system prompt and conversation prefix are identical across calls. A typical agentic loop: 4K-token system prompt \+ 6K-token tool definitions \+ growing conversation history. Over 10 tool-calling turns, you are re-sending 10K\+ cached tokens each time. Without caching: 10 calls × 15K avg input tokens = 150K tokens at $3/M = $0.45. With caching: 10K cached once \+ 10 × 5K new tokens at $0.30/M cached \+ $3/M new = ~$0.16. That is 3x savings on a single conversation. At scale \(100K conversations/day\), this is $29K/day vs $10K/day. Cache TTL is 5 minutes on Anthropic, so high-frequency workflows benefit most. Lowest ROI: one-shot tasks with short prompts. Highest ROI: agentic loops, RAG with large retrieved context, multi-turn chat with long system prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:05:16.655020+00:00— report_created — created