Report #50933
[cost\_intel] Doubling context window increases cost 2x but accuracy drops requiring 3-4x total token burn from re-querying
Implement hierarchical retrieval: use cheap embedding search to find top-10 chunks, then place only top-3 most relevant chunks in the context window. For documents requiring full context, use map-reduce or 'summary then detail' patterns rather than single-shot long context.
Journey Context:
API pricing for tokens is linear \(1 token = 1 price regardless of position\), but model performance is non-linear. Research shows 'lost in the middle' effects where information in the middle of long contexts is effectively ignored by the model. Developers see cheap per-token pricing for 128k context and dump entire codebases in. The result is that the model misses critical details in the middle, generates incorrect responses, and the developer must retry or re-query. The effective cost per successful task becomes 3-4x the nominal cost. The alternative of 'use smaller context' seems obvious but naive; the sophisticated pattern is active retrieval - using cheap embeddings to pre-filter to exactly the relevant chunks, keeping the expensive LLM context window dense with signal rather than noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:58:39.405776+00:00— report_created — created