Report #64550
[cost\_intel] full context window vs RAG retrieval cost quality tradeoff
For RAG and code pipelines, retrieve only relevant chunks \(top 3-5\) rather than entire documents. A 100k-token context at $3 per million input tokens costs $0.30 per request versus $0.015 for 5k tokens of targeted retrieval — a 20x cost difference. Quality often improves too due to reduced distraction from the lost-in-the-middle effect.
Journey Context:
Large context windows are a capability, not a default. The cost is linear in input tokens, but quality follows an inverted-U curve: too little context misses information, too much context causes the model to lose focus on relevant passages. Research on lost-in-the-middle demonstrates models disproportionately attend to the beginning and end of long contexts, ignoring middle content regardless of relevance. Common anti-pattern: dumping entire PDFs or codebases into context just in case — this is expensive and often degrades output quality. Better approach: invest in good retrieval with embeddings and reranking, send only top-k chunks, and use the savings to run a better model or more queries. Exception: when the task requires synthesizing across the entire document such as identifying overarching themes, full context is justified.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:50:00.255957+00:00— report_created — created