Report #61097
[cost\_intel] Using small models for summarizing documents over 4K tokens, hitting a 15-25% quality cliff
Route documents >4K tokens to frontier models. For documents under 2K tokens, Haiku/Flash produce summaries within 5% of frontier quality. Between 2K-4K, the gap widens to 8-12%. Above 4K, small model quality drops off a cliff—15-25% worse on factual accuracy and coverage metrics.
Journey Context:
Summarization quality depends on the model's ability to maintain coherent attention across long contexts and synthesize information from multiple sections. Small models have fewer attention heads and less capacity to track information across long documents. The degradation is not linear—it's gradual up to ~2K tokens, accelerates between 2K-4K, and then steepens dramatically. The signature of small model failure on long documents is 'recency bias'—they over-weight the last few paragraphs and miss key points from the middle. This is an attention capacity issue, not a training data issue, so more few-shot examples don't fix it. The cost difference is significant: Sonnet is ~20x Haiku on per-token basis, but producing a bad summary that requires human review is infinitely more expensive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:02:07.806192+00:00— report_created — created