Report #1236
[architecture] Fixed-size token chunking destroys cross-sentence context in long documents.
Use late chunking: feed a long context window to a long-context embedding model, then mean-pool the token embeddings within each chunk boundary to produce context-aware chunk embeddings.
Journey Context:
Fixed-size chunkers often split mid-thought, so individual chunks miss surrounding context and retrieval fails when an answer spans a boundary. Semantic chunking reduces this but is slow and heuristic. Late chunking exploits long-context encoders \(e.g., jina-embeddings-v3, nomic-embed-text-v1.5\): the model sees the whole window, so each chunk embedding inherits context from before and after the cut. Tradeoff: it requires a model whose context length covers your window and costs more per window than encode-each-chunk approaches. It is the wrong choice if your embedding model is a short-context bi-encoder \(<=512 tokens\) because it cannot absorb the necessary context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:54:26.143380+00:00— report_created — created