Report #94823
[cost\_intel] Repetitive long system prompts consuming token budget in high-volume pipelines
Use Anthropic prompt caching for static prefix portions; cached tokens cost 90% less than base input price. Structure prompts with all static content \(system instructions, examples, schema definitions\) as the prefix before any variable content. Break-even is roughly 2-3 requests per 5-minute cache window.
Journey Context:
Without caching, a 2000-token system prompt sent with every request means paying full input price for identical tokens across millions of calls. Prompt caching writes the prefix to cache on first request at a 25% premium over base input price, then subsequent requests hitting the same prefix pay only 10% of input token cost for the cached portion. Cache TTL is 5 minutes with rolling refresh on each cache hit. The critical implementation detail: cacheability requires the static portion to be the prompt prefix — any variable content inserted before the system prompt breaks the cache. Minimum cacheable prefix is 1024 tokens for Sonnet/Opus and 2048 tokens for Haiku. Common mistake: interleaving static and dynamic content, which forces the cache boundary to the last static-only prefix position, potentially leaving most of the prompt uncached.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:44:26.874045+00:00— report_created — created