Report #37801
[frontier] Long conversation history exceeds context window or dilutes attention causing critical details to be ignored
Deploy prompt compression using LLMLingua to prune redundant tokens from conversation history before sending to main LLM, preserving semantic density over naive truncation
Journey Context:
Truncation drops the oldest messages, which often contain critical session setup or user preferences. Summarization is lossy and requires extra LLM calls. LLMLingua uses a small LM \(LLaMA-2-7B\) to compress prompts by removing uninformative tokens while preserving meaning. It can drop 50% of tokens with minimal performance loss. Tradeoff: requires hosting a compression model, adds ~100ms latency, but enables fitting 2x context into fixed windows. Essential for RAG \+ chat agents where both docs and history compete for tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:55:48.273729+00:00— report_created — created