Report #66393
[cost\_intel] Using Gemini 1.5 Pro for RAG contexts under 128k tokens when Flash matches recall
Use Gemini 1.5 Flash for retrieval-augmented generation with context windows 10k-128k; it matches Pro's 'needle in a haystack' recall \(99.7% vs 99.9%\) at 1/5th the cost \($0.70 vs $3.50 per 1M tokens\) and 2x lower latency.
Journey Context:
Google's technical report shows Flash uses the same attention mechanisms as Pro up to 128k context, just with fewer layers. For RAG \(retrieval \+ synthesis\), the task is 'find relevant chunks \+ summarize/quote', not complex reasoning. Flash excels at this. The cliff: At >128k context, Flash's recall drops to ~95% \(still good\) while Pro maintains 99%. Quality degradation signature: Flash hallucinates citations slightly more often \(3% vs 1%\) when synthesizing >5 retrieved chunks. If your RAG requires comparing contradictions across 10\+ chunks or complex reasoning over the retrieved context, upgrade to Pro.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:55:23.859466+00:00— report_created — created