Report #68156
[cost\_intel] Using small models for long-document summarization and getting extractive, poorly prioritized outputs that miss key themes
Use Haiku/Flash for summarizing documents under ~3K tokens where extraction suffices. Use Sonnet/Pro for documents exceeding 3K tokens requiring synthesis, theme identification, and prioritization. Small models degrade to extractive copy-paste on long inputs; frontier models maintain abstractive quality across length.
Journey Context:
Summarization seems like it should work well on small models — it is a well-defined NLP task, and for short documents, small models do fine. But there is a length-dependent quality cliff. Short documents: small models produce clean abstractive summaries. Long documents: small models revert to extractive summarization — copying sentences verbatim, failing to synthesize themes, poor at prioritizing what matters. The degradation signature is summaries that read like a disjointed list of sentences from the source rather than a coherent synthesis. This matters because long-document summarization is exactly where token costs are highest, creating pressure to use cheaper models. But a bad summary of a long document gives false confidence that the content was understood. Cost comparison: summarizing a 10K-token document with Haiku is roughly 4-5x cheaper per call than Sonnet — but if the output requires manual review and rewriting, the labor cost dwarfs the API savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:53:01.275666+00:00— report_created — created