Report #71222
[frontier] Long-document embeddings losing local context within chunks due to independent processing
Apply embeddings to full document first to get context-aware token representations, then mean-pool over chunk boundaries \(late chunking\), rather than embedding chunks in isolation
Journey Context:
Standard chunking embeds text segments independently, losing document-wide context \(e.g., 'the defendant' in chunk 2 refers to 'John Doe' in chunk 1\). Late chunking \(Jina AI, 2024\) processes the entire document through the transformer first, obtaining context-aware token embeddings, then pools these into chunk representations. This preserves long-range dependencies without exceeding embedding model context limits during pooling. Tradeoff: requires longer inference during indexing \(full doc vs chunk\), but dramatically improves retrieval on coreference resolution and long-document understanding. Better than overlap chunking which only handles local context and still loses global coherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:07:33.828228+00:00— report_created — created