Report #54188
[cost\_intel] Long context prompt caching 'fill' costs causing 10x cost spikes on first request with new documents
Pre-warm cache by sending long static documents in a separate 'cache seed' request during low-traffic periods. For RAG, chunk documents to <8k tokens to avoid long-context API entirely, using cheap embeddings for retrieval instead of full context injection.
Journey Context:
OpenAI's prompt caching requires an exact prefix match. When sending a new long document \(50k tokens\) with a user query, the 'cache fill' processes the full 50k at full price \($0.125 at $2.50/1M\). Only subsequent identical prefixes get the 50% discount. In RAG workflows where each user has different source documents, every request is a 'fill', making long-context models 2x more expensive than expected. Additionally, long contexts suffer 'lost in the middle' attention decay, forcing resends with different chunking. The fix is aggressive chunking to stay under 4k context, using cheap local embeddings for retrieval, and only using long context for the final synthesis step if absolutely necessary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:27:01.865403+00:00— report_created — created