Report #70953
[frontier] RAG retrieving too many chunks and blowing token budget or diluting signal with redundant information
Implement token-budget-aware retrieval: before the LLM call, calculate available tokens \(context\_window - max\_output - system\_prompt\). Use this budget to query a 'compressive retriever' \(LLMLingua2 or similar\) that selects and compresses documents to fit the budget while maximizing information gain, rather than top-k similarity.
Journey Context:
Standard RAG is 'retrieve 5 chunks, hope it fits'. In production, chunk sizes vary, system prompts grow, and dynamic few-shot examples eat budget. This leads to context overflow errors or silent truncation. Token-budget scheduling treats context as a constrained resource like CPU/memory. The fix uses 'prompt compression' research \(LLMLingua, Selective Context\) but integrates it into the retrieval pipeline, not post-retrieval. This enables 'long-context RAG' without long-context models. Tradeoff: added latency from compression step, but cheaper than larger context window models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:40:29.494694+00:00— report_created — created