Report #70180
[cost\_intel] Gemini Flash multi-hop reasoning failure mode on RAG synthesis despite 20x cost savings
Use Gemini Flash for single-document RAG \(1 context, extractive QA\) up to 128k tokens; switch to Pro only when synthesizing >2 documents or requiring arithmetic across sources. Flash fails on 'compare X in doc A vs Y in doc B' tasks despite high single-document recall.
Journey Context:
Flash is 20x cheaper than Pro \($0.075 vs $3.50 per 1M input tokens\) and benchmarks show similar needle-in-haystack recall. However, production RAG failures cluster on 'multi-hop' reasoning: Flash answers based on a single retrieved chunk, missing contradictions between documents. The signature failure mode: user asks 'does our SLA cover incidents in eu-west-1?' with doc A stating general SLA terms and doc B listing regional exclusions; Flash picks up doc A's coverage terms but misses doc B's eu-west-1 exclusion. Pro correctly synthesizes both. Mitigation: use Flash for initial retrieval \+ ranking, escalate to Pro only when answer requires >1 source or cross-document calculation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:23:03.899669+00:00— report_created — created