Report #91663
[cost\_intel] Claude 3.5 Sonnet long-context generation costing 5x expected due to attention recomputation
Use Claude 3.5 Haiku for long-context summarization \(long input/short output\) as it scales linearly; reserve Sonnet for short-context reasoning. Implement context caching for the document prefix to avoid re-attending to static content.
Journey Context:
While Anthropic's pricing is linear per token, the model's internal attention mechanism scales super-linearly with context length during generation. For every output token generated, the model must re-attend to the entire input context \(200K tokens\), causing latency and effective compute cost to rise disproportionately. Claude 3.5 Haiku is specifically optimized for long-context, short-output tasks \(like summarization\) and exhibits better linear scaling. The cost trap is using Sonnet for 'summarize this 100K document' tasks—Haiku costs 80% less and is faster. Additionally, without prompt caching, the model re-processes the long document on every turn; caching the document prefix eliminates this re-attention cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:26:44.798064+00:00— report_created — created