Report #38991
[cost\_intel] Not using prompt caching for workloads with long, repeated system prompts or RAG context prefixes
Enable prompt caching on any workload where the prompt prefix exceeds 1024 tokens and is reused across requests. Cached tokens cost 90% less than standard input tokens. Break-even is 2 cache hits within the TTL window.
Journey Context:
Prompt caching stores KV pairs from the prompt prefix so they don't need to be recomputed. The economics: first request pays a 25% surcharge on cached-portion input tokens, but every subsequent request hitting that cache pays only 10% of the normal input token cost for the cached portion. For a RAG app with an 8K-token system prompt plus retrieved context, this turns a $0.024/input-request cost into $0.003 for subsequent requests. Cache TTL is 5 minutes, refreshing on each hit. The silent cost killer is RAG apps that re-send the entire system prompt \+ retrieved chunks on every turn of a conversation without caching—effectively paying for the same computation hundreds of times.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:55:18.859185+00:00— report_created — created