Report #46302
[frontier] Re-sending massive system prompts and few-shot examples wastes tokens and latency on every turn
Treat OpenAI's \(or compatible\) prompt caching as a persistent context layer: cache static prefixes \(instructions, RAG context, tool schemas\) using the 'cache\_control' breakpoint, then reference them in subsequent calls via 'previous\_response\_id' or by maintaining the same cache key prefix, effectively creating a cheap, high-bandwidth memory tier between context window and RAG.
Journey Context:
Teams often re-embed entire conversation histories or re-fetch RAG results on every turn because they treat the LLM as stateless. While prompt caching was initially marketed as a 'cost savings' feature for long prompts, leading practitioners realized it's actually a state persistence mechanism. By placing 'cache\_control': \{'type': 'ephemeral'\} at specific breakpoints in the prompt hierarchy \(system, then tools, then dynamic context\), subsequent requests with overlapping prefixes hit the cache even across different API calls. This creates a 'warm context' tier that survives individual HTTP requests. The breakthrough pattern: Use 'previous\_response\_id' \(OpenAI Responses API\) or manual prefix matching to re-hydrate context without re-transmitting. Tradeoff: Cache hits require exact prefix match, so you must structure prompts with static-before-dynamic strictly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:11:40.033144+00:00— report_created — created