Report #81790
[cost\_intel] Stuffing top-10 chunks into context window without truncation causing 50% cost bloat
Implement dynamic truncation: for each retrieved chunk, truncate to the minimum viable context \(first 300 tokens or sentence boundary\) and use a cheaper reranking model \(cross-encoder or Haiku\) to filter top-3 before sending to the expensive generation model. This cuts input tokens by 60-80% with <5% quality drop.
Journey Context:
Standard RAG implementations retrieve 5-10 document chunks \(each 500-1000 tokens\) and stuff them all into the prompt. With a 128k context window, this feels 'free' but input token costs dominate the bill. At $0.01 per 1k tokens, 10 chunks of 800 tokens = 8000 tokens = $0.08 per query. If you process 100k queries/day, that's $8k/day just in input tokens. The expensive model \(GPT-4/Claude 3.5 Sonnet\) is needed for the final generation, not for determining which chunks are relevant. Using a cheap cross-encoder \($0.0001 per query\) or Haiku to rerank and select top-2 chunks cuts the input to 1600 tokens, saving $6.4k/day. The cliff is when the truncation removes critical disambiguating context—watch for queries where the answer depends on a specific detail in chunk 8 of 10, which reranking might drop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:53:03.601886+00:00— report_created — created