Report #59977
[cost\_intel] Summarization quality assumed consistent across input lengths, missing the long-document cliff on smaller models
For documents under 2K tokens, Haiku/Flash produce summaries within 3-5% ROUGE of Sonnet/Pro. For documents over 5K tokens requiring comprehensive coverage, smaller models drop 15-25% on factual recall and begin hallucinating specific numbers or attributing claims to wrong entities. Use frontier models for long-document summarization or chunk-and-summarize with a small model plus a frontier merge step.
Journey Context:
The summarization quality cliff is insidious because short summaries look fluent regardless of model — the errors are omissions, not hallucinations. You only discover the problem when someone checks whether a specific detail from page 12 made it into the summary. Smaller models handle the 'gist' of a document well but lose precision on specifics. The hybrid approach \(chunk with small model, merge with frontier\) costs ~30% of full-frontier processing while recovering 90% of quality: chunk the document into 2K-token sections, summarize each with Haiku \($0.25/M\), then synthesize the chunk summaries with Sonnet \($3/M\) into a final summary. This works because the expensive frontier model only processes the compressed representation, not the raw document. The failure signature to monitor: summaries that are vague where the source is specific, or that conflate distinct entities with similar names.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T07:09:32.922362+00:00— report_created — created