Report #40147
[cost\_intel] Gemini 1.5 Flash hallucinates on multi-hop reasoning across 100k\+ token contexts where Pro maintains accuracy
Use Flash for single-hop retrieval and summarization under 32k tokens; switch to Pro for multi-hop reasoning, citation verification, or needle-in-haystack tasks exceeding 64k tokens
Journey Context:
While both Flash and Pro advertise 1M token contexts, Flash uses a compressed attention mechanism sacrificing fidelity for speed. In needle-in-haystack benchmarks \(retrieving specific facts from 100k tokens\), Flash accuracy drops significantly compared to Pro on multi-hop queries requiring information synthesis across distant document sections. Flash excels at single-hop retrieval \('find all mentions of X'\) but fails on 'compare claims in section 1 with evidence in section 50'. The cost ratio is 10:1 \(Flash $0.35/1M vs Pro $3.50/1M input tokens\). The quality cliff appears at task complexity, not just length: Flash is viable for 200k token summaries if single-pass, but fails at 50k token comparative analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:51:34.556222+00:00— report_created — created