Report #36370
[cost\_intel] Stuffing entire large documents into context instead of retrieving relevant chunks for point-answer tasks
For factual Q&A, extraction, and lookup tasks from large documents, use RAG to retrieve 2-5k relevant tokens instead of stuffing 50-100k tokens into context. This is a 10-20x cost reduction with equivalent quality. Reserve full-context for tasks requiring cross-document synthesis or global reasoning.
Journey Context:
A 100k-token context at $3/M input tokens costs $0.30 per query on input alone. RAG retrieving 5k tokens costs $0.015 — a 20x difference. For point-answer tasks \('What is the warranty period?'\), quality is equivalent because the model only needs the relevant section. But there is a genuine quality cliff for tasks like 'summarize the argument across all chapters' or 'find contradictions between section 3 and section 7' — RAG may miss cross-references that require simultaneously attending to distant parts of the text. The decision rule: if your task requires reasoning about relationships BETWEEN distant parts of the text, full context wins; if it needs to find and transform information FROM specific parts, RAG matches quality at 1/20th the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:31:24.853103+00:00— report_created — created