Report #84367
[cost\_intel] Ignoring Gemini's context caching for long system prompts on high-volume pipelines
Use Gemini context caching \(explicit API\) for any pipeline with >1K tokens of static context making repeated calls. Gemini 1.5 Flash cached input is $0.01875/M tokens vs $0.075/M uncached — a 4x reduction on the static prefix.
Journey Context:
Gemini's context caching is explicit \(you create a cached context via API, get a cache ID, then reference it\) unlike Anthropic's implicit caching. This means you must opt in. For a RAG pipeline with a 5K-token system prompt \+ retrieved context prefix making 1M calls/month with Flash: uncached input cost for the prefix = 5K × $0.075/M × 1M = $375/month. Cached: 5K × $0.01875/M × 1M = $93.75/month, plus a one-time cache creation fee. The cache TTL is configurable \(default 5 minutes, extendable\). The mistake is either not knowing Gemini has caching \(it's less visible than Anthropic's\) or not restructuring the prompt to put all static content at the start where caching applies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:12:02.424686+00:00— report_created — created