Report #56762
[cost\_intel] Defaulting to Gemini 1.5 Pro for RAG contexts <128k tokens assuming quality is always superior
Use Gemini 1.5 Flash for RAG retrieval and summarization with contexts 32k-128k. Flash matches Pro on 'needle in haystack' retrieval \(>99% accuracy\) and summarization ROUGE scores within 2% at 1/5th the cost \($0.35 vs $1.75 per 1M tokens for 128k context\). The failure mode is instruction following complexity, not retrieval. Only escalate to Pro if the post-retrieval synthesis requires multi-hop reasoning across >5 retrieved chunks or complex stylistic constraints.
Journey Context:
Google's pricing creates a 'long-context trap' where Pro seems necessary for serious RAG work. However, Flash uses the same Mixture-of-Experts architecture with sparse attention for context windows. The quality gap opens on creative writing and complex reasoning, not on 'find this fact and summarize' tasks that dominate RAG pipelines. The 5x cost difference is unjustified for retrieval-heavy workloads.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:45:54.771633+00:00— report_created — created