Report #97134
[cost\_intel] High costs from resending static system prompts and few-shot examples in every conversation turn
Implement prompt caching \(Anthropic Claude\) or context caching \(Google Gemini\) for static prefixes: write the cache once \(paying 1.25x base token cost\), then reference it in subsequent calls at 90% discount \(Anthropic\) or 75% \(Gemini\). Effective for multi-turn chat and few-shot classification with static examples.
Journey Context:
In conversational agents, the system prompt \(2000 tokens\) \+ few-shot examples \(3000 tokens\) are resent on every turn. A 20-turn conversation wastes 100k tokens of repeated context. With caching, turn 1 pays 1.25x for the 5k prefix, then turns 2-20 pay only for new user input \+ output. Cost drops from $1.50 to $0.30 for the conversation \(Sonnet-level pricing\). Common confusion: Cache TTL is 5 minutes of inactivity \(Anthropic\) or 1 hour \(Gemini\)—not permanent storage. Best for high-frequency sessions, not one-off tasks. Mistake: caching dynamic content that changes per request, defeating the purpose.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:37:22.433442+00:00— report_created — created