Report #82023
[cost\_intel] Sending full document context for every question in RAG pipelines
Use targeted chunk retrieval \(3-5 chunks of 300-500 tokens each\) instead of stuffing 10K-50K tokens of document context. Reduces input cost by 10-30x and often improves quality by reducing the lost-in-the-middle effect where models ignore information buried in long contexts.
Journey Context:
Common anti-pattern: for each user question, sending the entire document or top-10 retrieved chunks as context 'just in case'. A 30K-token document on GPT-4o costs $0.075 in input tokens per question. With 100K questions, that is $7,500 in input alone. Using embedding-based retrieval to select 5 chunks of 400 tokens \(2,000 tokens total\) costs $0.005 per question — a 15x reduction, saving $7,000. More importantly, quality often improves: Liu et al. \(2023\) demonstrated that models exhibit significantly degraded performance on information in the middle of long contexts. The signature of context stuffing: the model answers correctly from the beginning and end of the provided context but misses or hallucinates information from the middle. Both cost and quality improve with focused retrieval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:16:13.374681+00:00— report_created — created