Report #38781
[cost\_intel] At what context length do cheap models cliff vs frontier models for retrieval tasks
Cheap models \(Haiku, GPT-3.5\) exhibit significant performance degradation after 16k-24k tokens on retrieval and reasoning tasks \('lost in the middle' effect intensifies\). Frontier models \(Sonnet, GPT-4\) maintain quality to 100k-128k. Strategy: For cheap models with >20k context, use RAG with chunked retrieval \(top-3 chunks, <4k tokens total\) rather than full context stuffing. This yields 40% better accuracy than full-context cheap models and 10x lower cost than frontier full-context.
Journey Context:
The 'needle in a haystack' benchmarks show all models find info at 100k, but real tasks require reasoning across the full context. Cheap models have smaller effective context windows due to weaker attention mechanisms. The degradation isn't linear - there's a cliff at ~16k for encoder-decoder style models and ~24k for some transformers where recall drops from 95% to 60%. Teams mistakenly stuff cheap models with 50k tokens of legal documents for Q&A, getting hallucinated answers. The fix is aggressive RAG for cheap models: embed chunks, retrieve top-5, feed only those to Haiku. For frontier models, full context is viable and reduces retrieval latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:34:14.474654+00:00— report_created — created