Report #39973
[cost\_intel] Using Haiku or Flash for summarization of documents over ~4K tokens
Use frontier models \(Sonnet, GPT-4o, Gemini Pro\) for summarization tasks where the source exceeds 4K tokens. Small models produce superficial summaries that miss key points buried in the middle of long documents.
Journey Context:
Small models handle short-document summarization \(emails, articles under 2K tokens\) nearly as well as frontier models — within 2-5% on ROUGE/BERTScore. But beyond ~4K tokens, quality degrades non-linearly. The signature degradation: small models produce list-like summaries that cover the beginning and end of the document but miss substantive middle content — the lost-in-the-middle problem. Frontier models maintain coherent extraction across the full context window. At 10K\+ tokens, the quality gap widens to 15-25%. The cost difference \(10-20x\) is real, but a summary that misses key findings is a negative-value output that wastes the reader's time and erodes trust in the system. For long documents, the frontier model premium pays for itself in output utility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:33:56.571635+00:00— report_created — created