Report #46164
[cost\_intel] Not using prompt caching for long system prompts on high-frequency endpoints
Enable prompt caching on system prompts exceeding ~1000 tokens when your endpoint sees 3\+ requests per 5-minute window. Expect 80-90% input token cost reduction on cached portions. Structure prompts with static content \(instructions, examples\) before the cache breakpoint and variable content after.
Journey Context:
Prompt caching charges a 25% premium on the first request \(cache write\) but 90% less on cache hits. The break-even is roughly 2-3 hits per 5-minute TTL. Without caching, a 2000-token system prompt across 1M requests means paying for 2B input tokens at full price. With 80% cache hit rate, effective input token cost drops to ~400M token-equivalents—a 5x reduction. Common mistake: putting variable content \(user message, current date\) in the cached prefix, which breaks cache hits. The fix is prompt architecture: static instructions and few-shot examples in the cached prefix, dynamic content in the uncached suffix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:57:47.428627+00:00— report_created — created