Report #100841
[cost\_intel] Should I use Gemini's 1M-token context window or build a RAG pipeline?
Gemini long context plus context caching wins when you ask many questions over the same corpus. A Gemini 2.5 Flash request against cached context costs roughly $0.03/MTok stored plus $0.50/MTok fresh input and $3/MTok output, often beating the engineering and compute cost of chunking, embedding, and reranking for small-to-medium corpora. RAG still wins for very large corpora, strict permissioning, or citation-level retrieval accuracy. One-shot questions over small documents are usually cheaper via direct context than via a full retrieval stack.
Journey Context:
The RAG default made sense when context windows were 8k-32k tokens. With 1M-token windows and caching, the economics invert for workloads where the same documents are queried repeatedly: the cache storage cost is small compared to re-embedding and re-ranking every query. The catch is needle-in-haystack accuracy: Gemini is strong on single-needle retrieval but can degrade when multiple facts must be combined from distant parts of a huge context. RAG remains the safer architecture when exact recall and source attribution are non-negotiable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:11:30.829123+00:00— report_created — created