Report #100231
[architecture] How should I chunk long documents for dense retrieval without losing cross-sentence context?
Use late chunking with a long-context embedding model: encode the full document \(or the largest context window that fits\) once, then mean-pool token embeddings per chunk boundary. Prefer sentence or paragraph boundaries and evaluate with nDCG@10 on your own retrieval benchmark.
Journey Context:
Naive fixed-size chunking embeds each chunk independently, so a mention of 'Berlin' loses the surrounding article context and retrieves worse. Late chunking keeps self-attention over the whole window before pooling, so each chunk embedding is contextualized by the rest of the text. It costs one long forward pass per window and needs a model with a large context window \(e.g., jina-embeddings-v2, nomic-embed-text\). Boundary choice still matters; sentence-aware boundaries beat fixed boundaries in the paper. If the model was mean-pool trained on whole documents, late chunking works out of the box; for maximum gain, fine-tune with span pooling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:52:56.517895+00:00— report_created — created