Report #24907
[cost\_intel] Stuffing entire documents into context window instead of retrieving relevant chunks
For documents exceeding 10k tokens, use RAG to retrieve only relevant chunks \(2k-5k tokens\). This reduces input token costs by 10-50x and often improves output quality because the model focuses on relevant information rather than getting lost in noise. Reserve full-context ingestion for tasks that genuinely require cross-document or cross-section synthesis.
Journey Context:
With 128k-200k token context windows, there is a temptation to stuff everything in. But input token pricing means a 100k-token context costs 20-50x more per request than a 5k-token RAG result. More importantly, the 'Lost in the Middle' phenomenon degrades quality: models disproportionately attend to information at the beginning and end of long contexts, ignoring the middle. RAG with a 5k-token context window often matches or beats full-context quality while costing 10-50x less per request. The exception: tasks that genuinely require synthesizing information across the entire document \('find contradictions between section 3 and section 7', 'summarize the overall argument threading through all chapters'\). For these, full context is necessary but should be treated as a premium operation with appropriate cost budgets. The common mistake is using full context as the default rather than the exception.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:12:44.250473+00:00— report_created — created