Report #54228
[cost\_intel] Token bloat from redundant RAG chunks
Implement MinHash LSH deduplication \(threshold 0.8 Jaccard\) and semantic compression: use Haiku/Flash to summarize chunks >512 tokens to 128 token extracts before sending to Sonnet/GPT-4o. Reduces context tokens by 60-80% with <2% accuracy loss, preventing the 'needle in haystack' failure mode from filler text.
Journey Context:
RAG systems retrieve 10-20 chunks blindly, causing repetition \('The company was founded in 2010' appears in 3 chunks\) and filler bloat. This triggers the 'lost in the middle' problem where the LLM misses key facts buried at chunk boundaries. Deduplication with MinHash is O\(n\) and catches near-duplicates from overlapping PDF pages. Semantic compression extracts only sentences relevant to the query using a cheap model, ensuring the expensive model sees only signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:31:03.889346+00:00— report_created — created