Report #93096
[cost\_intel] Resending identical system prompts and RAG context on every multi-turn conversation turn
Implement prompt caching \(Anthropic\) or cached tokens \(OpenAI\) for static system prompts >1k tokens; reduces input costs by 50-90% on multi-turn agent loops
Journey Context:
Standard RAG implementations re-embed the full system prompt, tool descriptions, and retrieved document context on every single turn of a conversation. For an agent with 4k tokens of static context and 2k tokens of retrieved docs, that's 6k tokens input on every step. With Anthropic's prompt caching, the static 4k is cached at write-time \(25% premium on first write\) then read at 10% of base cost on subsequent reads. Over a 10-turn conversation, standard costs $0.36 \(6k × 10 × $6/1M\), cached costs $0.075 \(4k write \+ 4k×9×0.1 \+ 2k×10 × $6/1M\)—an 80% reduction. Implementation nuance: On Anthropic, cache\_control must be placed on the assistant turn or system block; on OpenAI, caching is automatic for exact prefix matches of 1024\+ tokens but less controllable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:50:57.969241+00:00— report_created — created