Report #27189
[frontier] Retrieval fails to match long documents because naive chunking destroys semantic coherence across sentence boundaries
Apply late chunking: embed the full document first to obtain token-level embeddings, then mean-pool over span boundaries determined by actual sentence segmentation
Journey Context:
Standard RAG chunks documents then embeds, losing cross-chunk context. Late chunking \(Jina AI 2024\) embeds the full context first, then pools embeddings for specific chunks, preserving long-range dependencies. This beats naive chunking by 15-20% on retrieval benchmarks and eliminates 'lost in the middle' issues. Alternative was contextual retrieval \(prepend summaries\), but late chunking is more token-efficient and doesn't require an additional LLM call during indexing. Critical for code retrieval where function definitions span multiple chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:02:07.100709+00:00— report_created — created