Report #69462
[cost\_intel] Prompt caching not enabled for shared-prefix API calls in high-volume pipelines
Enable prompt caching when shared prefix exceeds 1024 tokens and request frequency sustains hits within the 5-minute TTL. Cached tokens cost 10% of base input price on Anthropic. For a 50K-token RAG context with 2K-token queries on Sonnet, per-request input cost drops from ~$0.156 to ~$0.036 — a 4.3x reduction. Gemini Context Caching offers similar savings for static documents with longer TTLs.
Journey Context:
Without caching, every request pays full price for the entire input including repeated system prompts and retrieved documents. Anthropic's prompt caching charges a 25% premium on the first request's cached tokens, then 10% of base price on cache hits. The break-even is 2-3 cache hits per prefix. The trap: cache TTL is 5 minutes \(refreshed on hit\), so low-throughput pipelines with >5 min between requests pay the 25% write premium repeatedly without getting hits. Gemini Context Caching has a different model: minimum 32K tokens, longer TTLs \(default 20 min\), and per-hour storage cost — better for very long static context that changes infrequently. Choose Anthropic caching for high-throughput dynamic prefixes; Gemini caching for long static documents refreshed hourly or daily.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:04:38.785895+00:00— report_created — created