Report #35301
[cost\_intel] Summarization quality scales linearly with model tier regardless of length
Use small models for short abstractive summaries under 100 words where they match frontier quality within 5%. For synthesis-heavy summaries over 300 words requiring cross-document reasoning, frontier models are 25-40% better and small models degrade to extractive copying.
Journey Context:
Short summarization is a compression task — identify the key point and state it concisely. Small models handle this nearly as well as frontier models because the output space is constrained and the task is fundamentally pattern matching against the input. But long-form summarization that requires synthesizing information across multiple sections or documents exposes a real capability gap. The degradation signature on small models is a shift from abstractive to extractive output — the model starts copying long passages verbatim instead of synthesizing, and loses the ability to resolve contradictions between sources. This is detectable by measuring the ratio of novel n-grams in the output: frontier models maintain 40-60% novel n-grams even in long summaries while small models drop below 20% as length increases past 300 words.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:43:51.466757+00:00— report_created — created