Report #96173
[frontier] Naive RAG retrieves redundant content, wasting tokens on duplicate or near-duplicate context already known to the agent
Implement Content-Defined Chunking \(CDC\) with similarity filtering: chunk at natural boundaries using rolling hash; deduplicate via SimHash; retrieve only deltas between current context and target knowledge
Journey Context:
Standard RAG retrieves entire documents or fixed-size chunks, often returning content the agent already processed \(e.g., the same API doc in previous turns\). Frontier systems use Content-Defined Chunking \(CDC\)—a technique from rsync/backup systems—where chunk boundaries are determined by content fingerprint \(rolling hash\) rather than fixed size, ensuring stable chunks even with text insertions. Combined with SimHash for near-duplicate detection, the system filters out chunks already present in the agent's current context window. This 'differential retrieval' cuts redundant context by 60-70%, crucial for iterative coding agents that repeatedly reference the same codebase.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:00:28.297456+00:00— report_created — created