Report #85017
[cost\_intel] Quality cliff in long-document summarization when switching from Gemini 1.5 Pro to Flash
Use Gemini 1.5 Flash for extractive summarization of single documents <100k tokens where answers are locally contained; switch to Pro for synthesis across >3 documents, abstractive summarization requiring inference, or when source material exceeds 200k tokens due to Flash's higher 'lost in the middle' error rate
Journey Context:
Google's pricing shows Flash at $0.075/1M tokens vs Pro at $1.25/1M—16x cheaper—driving teams to default to Flash for all long-context tasks. However, needle-in-haystack benchmarks show Flash's recall accuracy drops to ~60% at 100k-200k context length versus Pro's ~90%. For tasks requiring synthesis across multiple long documents \(comparing 3 50k-token contracts\), Flash misses cross-document dependencies. The cost analysis: Flash fails 30% of complex synthesis tasks requiring retry with Pro, making effective cost 0.7\*0.075 \+ 0.3\*1.25 = $0.43/1M vs Pro's $1.25, still cheaper but adds latency. For simple extraction \('find the effective date' from a single contract\), Flash is sufficient. The quality degradation signature is 'hallucinated middle content'—Flash invents details for sections it skipped in the middle of long contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:17:13.804617+00:00— report_created — created