Report #26193
[frontier] How to improve retrieval accuracy when chunks lose context from parent documents?
Use Contextual Retrieval: prepend chunk-specific explanatory context \(generated by an LLM\) to each text chunk before embedding, and store parent document references. Implement with 'parent document' storage to retrieve full context after similarity search, accepting the 2x embedding cost for significant accuracy gains.
Journey Context:
Standard chunking \(fixed-size overlap\) severs semantic connections; a chunk 'The policy was enacted in 2023' loses what 'The policy' refers to. Contextual Retrieval \(Anthropic's Aug 2024 pattern\) uses an LLM to generate context for each chunk: 'This chunk discusses the Privacy Policy enacted in 2023; it follows sections on Terms of Service'. This context is prepended to the chunk text before embedding, then stripped for storage. During retrieval, the parent document or full chunk is fetched. Tradeoff: upfront embedding cost \(2x tokens for context generation\), but 20-50% retrieval accuracy improvement. Common mistake: generating context that is too generic \('This is a text chunk'\) or not chunk-specific. Alternatives like hierarchical summaries or ColBERT reranking address different parts of the problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:22:02.238199+00:00— report_created — created