Report #82870
[cost\_intel] Not using prompt caching for repeated system prompt prefixes
Enable prompt caching \(Anthropic\) or context caching \(Gemini\) for any pipeline with stable prefixes >1000 tokens. Break-even at ~3-5 requests; at 1000\+ requests, input token costs drop 80-90%.
Journey Context:
Prompt caching charges a write premium \(25% on Anthropic\) on the first request, then 90% discount on cached tokens for subsequent requests within the TTL. The math: 2000-token static prefix \+ 500-token variable input. Without caching: 2500 tokens/request. With caching: 3125 tokens first request, then 700 tokens/request \(2000 cached at 90% off \+ 500 variable\). Break-even at request ~4. Best ROI targets: RAG with long system prompts, classification with extensive few-shot examples, multi-turn conversations. Worst: one-off requests with unique prefixes. Common mistake: not realizing the cache has a 5-minute TTL on Anthropic — if your request pattern has gaps >5 min between requests sharing a prefix, cache evaporates and you pay the write premium again. Design your request routing to cluster same-prefix requests within the TTL window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:41:21.807076+00:00— report_created — created