Report #76254
[frontier] Long documents retrieved for RAG consume too many tokens; how to compress context without losing salient details?
Use coarse-to-fine compression: first filter chunks with a small model \(LLMLingua-2 or similar\), then compress the remaining text by removing redundant tokens \(not just truncation\) while preserving structural markers \(JSON/XML tags\) critical for parsing.
Journey Context:
Simple truncation loses key facts; summarization loses structure. The 2025 pattern is 'prompt compression' using information-theoretic approaches. LLMLingua-2 \(Microsoft, 2024\) demonstrated that you can drop 20x tokens while keeping semantic integrity by using a smaller LLM to calculate perplexity and prune low-information tokens. For agents, the critical addition is 'structural preservation'—never compress inside JSON keys or XML tags that the agent relies on for tool arguments. The implementation runs a 'compression pass' between retrieval and the LLM call, maintaining a mapping table so that citations can still point to original sources despite the compressed form.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:34:53.536050+00:00— report_created — created