Report #95017
[cost\_intel] Passing full documents or repository contexts to models when only targeted sections are needed
Implement retrieval \(RAG\) to select only relevant chunks before the LLM call; for a 50K-token codebase where a task needs 2K tokens of context, this reduces input cost by 25x with equal or better quality due to reduced attention dilution
Journey Context:
The cost math is straightforward: 50K input tokens at $3/M = $0.15/request vs 2K at $3/M = $0.006/request. But the quality angle is counterintuitive: more context often degrades quality. Models exhibit attention dilution — relevant information competes with irrelevant context, and small models are especially susceptible. The signature: models start hallucinating by conflation, mixing details from unrelated sections. The anti-pattern is particularly common in code assistants that dump entire repos into context. The ROI inflection: RAG adds infrastructure complexity \(embeddings, vector DB, retrieval logic\), so it only pays off at over 100 requests/day on the same document corpus, or when documents exceed 10K tokens. Below that threshold, the engineering cost of RAG exceeds the API savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:04:05.605892+00:00— report_created — created