Report #98991
[cost\_intel] Repeated system prompt and RAG context are inflating per-request costs on Claude
Mark the stable prefix with cache\_control: \{type: 'ephemeral'\}. Cache reads bill at 0.1x input price after a 1.25x write; break-even is roughly one reuse within the 5-minute TTL. Put stable content first and volatile content last, because a single dynamic byte in the prefix invalidates the cache for every token after it.
Journey Context:
Teams often re-bill the same 5K-token system prompt or 50K-token document corpus on every turn. Anthropic prompt caching stores the KV state of an exact prefix; the write surcharge pays for itself on the first hit. The most common mistake is placing timestamps, request IDs, varying tool lists, or unordered JSON in the cached prefix, which destroys hit rate. This is the highest-leverage cost lever for agentic and RAG workloads because it uses the same model with no quality loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:07:24.412183+00:00— report_created — created