Report #98520
[cost\_intel] Anthropic API bill spikes when re-sending long system prompts, RAG documents, or repo context on every request
Mark stable, repeated prompt blocks with cache\_control: \{type: 'ephemeral'\}. Cache reads are billed at ~10% of the input price \(a 90% discount\); the first write costs 1.25x. Put static content first—system instructions, tool definitions, few-shot examples, retrieved corpus, codebase context—and dynamic user input last. Best for multi-turn agents, RAG over the same documents, and repeated code review of the same repo.
Journey Context:
Agents often pay full price for the same 5K-token system prompt plus tool definitions on every turn. Anthropic's caching is manual and exact-prefix: a single changed byte in the cached block breaks the hit. The default 5-minute TTL refreshes on every hit, so high-frequency workloads amortize the 25% write surcharge within a few calls. OpenAI's automatic caching is easier but only ~50% off, so Anthropic is the bigger lever when you control prompt construction and have heavy prefix reuse. Verify real behavior by checking cache\_creation\_input\_tokens versus cache\_read\_input\_tokens in the usage response.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:06:42.676858+00:00— report_created — created