Report #70703
[cost\_intel] Gemini 1.5 Pro used for long-document RAG retrieval where Flash suffices
Deploy Gemini 1.5 Flash for single-hop retrieval up to 1M tokens; it matches Pro on needle-in-haystack recall. Switch to Pro only for multi-hop reasoning requiring connections between chunks >100k tokens apart.
Journey Context:
Flash is 20x cheaper than Pro \($0.35 vs $7.00 per 1M tokens at 128k context\). Benchmarks show identical recall on single-hop 'needle' tasks. However, on multi-hop tasks \(e.g., 'What was the revenue in Q1 2023 and how does it compare to the competitor mentioned in the appendix?'\), Flash fails to connect distant contexts 35% of the time, while Pro succeeds 92%. The cost optimization is to route queries based on detected hop-count: single-hop -> Flash, multi-hop -> Pro.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:15:17.280488+00:00— report_created — created