Report #70691
[architecture] Fixed-size or semantic chunking loses cross-chunk context in long documents
Use late chunking: run the full document through a long-context mean-pooling embedding model first, then pool token embeddings into chunk spans. Start with jina-embeddings-v3 or any encoder with >=8k context and mean pooling; apply it as a layer on top of your existing boundary strategy rather than replacing it.
Journey Context:
Traditional chunk-then-embed pipelines embed each chunk in isolation, so pronouns like 'it' or 'the technology' lose their referents when the antecedent sits in another chunk. Late chunking inverts the order: the transformer sees the whole document, every token embedding inherits bidirectional context, and only then do you mean-pool spans into chunk vectors. The Jina AI paper shows nDCG@10 gains up to 6.5 points on longer documents and near-zero gains on short texts, proving the benefit scales with cross-chunk dependencies. The catch: the whole document must fit the model context window, and CLS-token models or models without mean pooling will not work. It also does not rescue a weak embedding model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:14:14.512932+00:00— report_created — created