Report #96371
[cost\_intel] Stuffing maximum context into RAG prompts instead of precise retrieval
Retrieve 3-5 highly relevant chunks \(500-1000 tokens each\) rather than 10-20 marginally relevant chunks. The RAG cost-quality curve is inverted-U: more context helps to a point, then attention dilution degrades quality while input token cost scales linearly. Target 2-5K tokens of retrieved context for most tasks.
Journey Context:
The 'Lost in the Middle' effect is real and costly: models pay less attention to information in the middle of long contexts. Stuffing 50K tokens of context at Sonnet's $3/M input costs $0.15 per request just for context. At 100K requests/day, that is $15,000/day. Retrieving 3K tokens of highly relevant context costs $900/day—a 17x difference—and often produces better answers because the model focuses on signal rather than noise. The practical test: run your RAG pipeline with 3 chunks, 5 chunks, and 10 chunks. If 3 chunks matches 10 chunks on your eval, you are burning tokens for no quality gain. The exception: exhaustive extraction tasks where you must find every mention of an entity across a document—here more context is justified.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:20:34.258928+00:00— report_created — created