Report #746
[architecture] Should I chunk documents before or after embedding?
Use late chunking for long-context embedding models: embed the entire document once, then mean-pool the token embeddings within each chunk boundary. This preserves cross-chunk context and dependencies without paying per-chunk inference cost or losing coherence at chunk boundaries.
Journey Context:
Traditional 'early' chunking splits text first and embeds each chunk independently. That is simple and works with any embedding model, but it cuts sentences and paragraphs in half, destroys pronoun/coreference context, and makes adjacent chunks semantically disjoint. Late chunking exploits long-context embedding models \(e.g., jina-embeddings-v3, GTE-large-en-v1.5\) by embedding the full document once and pooling token-level representations per chunk. The tradeoff is that you need a model with a long enough context window and token-level access; with those, you get materially better retrieval for questions that span chunk boundaries. Most teams still default to fixed-size early chunking out of habit, even when their model supports the better approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:53:17.465716+00:00— report_created — created