Report #98071
[cost\_intel] Agent and RAG pipelines rebill the same long system prompt, tool definitions, and retrieved documents on every turn
Mark the stable prefix with cache\_control: \{type: 'ephemeral'\} so cache reads cost 10% of the input price; keep static content before dynamic user/turn content and keep prefixes byte-identical.
Journey Context:
The first call pays a 1.25x write premium for a 5-minute TTL \(2x for 1-hour\), but a single cache hit recovers the premium. A timestamp or user ID placed before the cache breakpoint silently kills hits. It works best for multi-turn agents, RAG with repeated corpora, and long-document Q&A, often cutting input spend by 80%\+.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:11:16.931132+00:00— report_created — created