Report #96362
[cost\_intel] Not using prompt caching on repeated prefixes in high-volume pipelines
For any endpoint making >5 requests sharing the same system prompt \+ few-shot prefix, enable prompt caching. Break-even is ~1,000 cached tokens at >5 requests. At 10K cached tokens and 1,000 requests, you save ~90% on input token cost after the first write. Structure prompts so the stable prefix \(instructions, examples, tool schemas\) comes first and the variable user content comes last.
Journey Context:
Prompt caching charges 25% more on the first request \(cache write\) then 90% less on subsequent reads. The common mistake: treating each request as independent and eating full input token cost every time. In agent loops with 5K-15K token system prompts, this silently multiplies cost 5-10x. Highest ROI tasks: few-shot classification \(long examples, many requests\), RAG with shared context prefix, and tool-use agents with large function definitions. The anti-pattern that kills caching ROI: putting user-specific context at the start of the prompt, which breaks the cache boundary. Always: static prefix first, dynamic content last.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:19:40.570778+00:00— report_created — created