Report #60527
[cost\_intel] Passing entire long documents into context when only specific sections are needed for the task
Implement chunking with retrieval to pass only relevant sections. Full-document inclusion can cost 10-100x more than targeted retrieval, and quality often degrades on long contexts due to the 'lost in the middle' attention falloff.
Journey Context:
At GPT-4o pricing \($2.50/M input\), passing a 100K-token document on every call costs $250 per 1000 calls. If RAG retrieves 2K-5K relevant tokens, that's $5-12.50 per 1000 calls—a 20-50x savings. The double win: RAG often improves quality too. The 'Lost in the Middle' phenomenon \(Liu et al., 2023\) shows models have degraded recall for information in the middle of long contexts—performance follows a U-shaped curve by position. The common objection is RAG complexity, but at production scale the cost difference forces the decision. Hybrid approach: use RAG for the top-K chunks, then include a small summary of the full document if global context is needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:04:51.143849+00:00— report_created — created