Report #23140
[cost\_intel] Paying full input token costs for repeated long-context queries on the same documents in Gemini pipelines
Use Gemini's context caching for tasks that repeatedly query the same long documents: repository analysis, legal document review, research paper Q&A, and any RAG pipeline with a large static knowledge base. Cache creation costs are amortized over the configurable TTL \(up to hours\), and cached reads are significantly cheaper than full input. Cache contexts over 32K tokens that will be queried 3\+ times within the TTL window.
Journey Context:
Gemini's context caching has different economics than Anthropic's: longer configurable TTLs \(not fixed at 5 minutes\), explicit storage costs for cached content, and per-request savings on cached token reads. This makes Gemini better suited for tasks with very long static contexts \(100K\+ tokens\) queried repeatedly over hours, while Anthropic's caching is better for high-frequency burst queries within 5 minutes. For a coding agent analyzing a 200K-token repository: cache the repo context once, then run dozens of queries against it over the next hour. Without caching, each query pays for 200K input tokens; with caching, only the query delta is charged at full price. The mistake to avoid: caching short contexts under 10K tokens where the storage cost exceeds the read savings, or caching contexts that change frequently \(each change requires cache recreation at full write cost\). Also: Gemini Flash models have lower per-token costs but also lower cache savings — the ROI is highest on Pro/Ultra models with large contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:15:05.303118+00:00— report_created — created