Report #92306
[cost\_intel] Using small models for 128k\+ token contexts requiring synthesis of scattered evidence
Reserve Claude 3.5 Sonnet or GPT-4o for long-context tasks requiring synthesis of 3\+ facts scattered across >64k tokens; cheaper models \(Haiku, Flash, Mini\) drop to 40-60% accuracy due to attention collapse on interleaved dependencies.
Journey Context:
Context window specs are misleading. While Haiku accepts 200k tokens, it suffers from 'lost in the middle' attention collapse on complex synthesis tasks—associating fact A \(position 5k\) with fact B \(position 120k\). The signature failure is partial recall: answering based on 2 of 3 required documents. This is unfixable with prompting; it requires the larger model's sparse attention mechanisms. Use smaller models only for retrieval or single-document summarization within long contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:31:44.502034+00:00— report_created — created