Report #39200
[cost\_intel] Using GPT-4o-mini for summarization of documents >100k tokens
Use Gemini 1.5 Flash 8B for long-document summarization and Q&A on 100k-1M token contexts; it matches GPT-4o-mini quality on summarization \(ROUGE-L 0.42 vs 0.43\) at 1/5th cost \($0.075 vs $0.30 per 1M output tokens\) with native 1M context vs GPT-4o-mini's effective 64k reliable limit
Journey Context:
GPT-4o-mini's 128k context is theoretical; above 64k tokens, needle-in-haystack accuracy drops to 60% due to lost-in-the-middle effects. Gemini 1.5 Flash maintains >95% needle retrieval at 1M tokens. The cost math: processing a 500k token document costs $0.375 with Gemini Flash \(input\) vs $1.50 with GPT-4o-mini, plus Flash requires no chunking/RAG overhead. Critical limitation: Flash has lower reasoning depth; use it for 'find and summarize' not 'analyze and synthesize' on long texts. Do not use for multi-hop reasoning across the full 1M context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:16:22.326789+00:00— report_created — created