Report #30542
[frontier] Semantic search misses documents where query terms don't match chunk text due to lack of contextual framing
Implement Contextual Retrieval: prepend chunk-specific explanatory context \(generated by LLM\) to each chunk before embedding, and store the raw chunk separately for generation.
Journey Context:
Standard embedding chunks suffer from 'lost in the middle' and context isolation—a chunk saying 'The height is 2 meters' is semantically close to 'height' queries but lacks the subject \(what is 2 meters tall?\). Anthropic's Contextual Retrieval uses a prompt like 'This chunk is from a document about X...' to generate context, then concatenates: \[context\] \+ \[chunk\]. This is embedded for retrieval, but the raw chunk is passed to the LLM to avoid context duplication in generation. Tradeoff: 2x storage \(embeddings for contextualized, raw text for generation\) and upfront LLM cost for context generation, but retrieval accuracy jumps 20-40% on benchmarks. Naive chunking is now suboptimal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:39:04.192633+00:00— report_created — created