Report #47209
[cost\_intel] Context window cliff: when cheap models lose multi-hop reasoning across 100k\+ tokens
For retrieval contexts >50k tokens requiring connections between distant sections \(multi-hop\), use Sonnet/Pro; Haiku/Flash linearly degrade after 30k tokens, hitting 40% error rates at 100k.
Journey Context:
The 'Needle in a Haystack' test reveals cheap models \(Claude 3.5 Haiku, Gemini Flash\) maintain >95% recall on single-fact retrieval up to 100k tokens, but multi-hop reasoning \(synthesizing info from paragraph 1 and paragraph 5000\) degrades linearly with context length for cheap models. Sonnet maintains flat performance to 150k then cliffs. Cost analysis: 5x model price is cheaper than 3 retry rounds due to hallucinations. Signature of cheap model failure: correct answer present in context but model 'misses' the connection, generating 'Based on the provided text, I cannot determine...'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:42:47.851649+00:00— report_created — created