Report #87229
[cost\_intel] Stuffing entire document collections into long context instead of using RAG
For extraction and QA tasks, use RAG with top-K retrieval into a 4K-8K token context window rather than stuffing 100K\+ tokens into a single call. This is both cheaper \(25-50x on input tokens\) and higher quality: retrieval accuracy degrades significantly when relevant information sits in the middle of long contexts \(the lost-in-the-middle effect\). Only use full long-context for tasks that genuinely require cross-referencing across the entire document.
Journey Context:
Long context windows feel like a clean solution — just dump everything in and let the model figure it out. But the cost is brutal: 100K input tokens at GPT-4o rates equals $0.50/request vs 4K with RAG at $0.02/request. And quality often gets worse, not better. The lost-in-the-middle phenomenon shows models disproportionately attend to the beginning and end of long contexts, missing information in the middle. RAG with 5-10 retrieved chunks at 500 tokens each gives the model focused, relevant context. Reserve long-context for genuine cross-reference tasks: comparing clauses across a contract, identifying contradictions across documents, or synthesizing themes from a full corpus.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:00:18.604581+00:00— report_created — created