Report #62131
[cost\_intel] Dumping entire documents into LLM context for retrieval-augmented generation instead of targeted chunk retrieval
Retrieve only the top 3-5 most relevant chunks \(2-5K tokens total\) rather than entire documents. For most QA tasks, this reduces cost by 10-30x with <5% quality loss. Only expand context when the task explicitly requires cross-document synthesis or answers that span widely separated sections of a document.
Journey Context:
With 200K token context windows, it is tempting to dump everything in. But at Sonnet pricing, a 100K-token input costs $0.30 per call vs $0.006 for a 2K-token input — a 50x difference. The quality reality: LLMs exhibit 'lost in the middle' effects where information in the middle of long contexts is poorly utilized \(accuracy drops 10-20% for middle-positioned facts vs beginning/end\). Aggressive retrieval with 3-5 chunks typically matches or exceeds full-document quality for factual QA because the model focuses on relevant information rather than being diluted by noise. The degradation signatures: over-retrieving causes hedging \('it depends on which section...'\), self-contradiction from conflicting passages, or fixation on irrelevant but prominent information. Under-retrieving causes 'I don't know' responses — which is preferable to confident hallucination from overloaded context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:46:18.602829+00:00— report_created — created