Report #669
[architecture] Naive chunk-then-embed loses cross-chunk context and anaphoric references
Use late chunking: feed the whole document through a long-context embedding model, then pool token embeddings into chunk vectors. Only chunk after the transformer, just before mean pooling.
Journey Context:
Standard chunking splits first and embeds each piece in isolation, so a chunk containing 'its revenue grew 3%' cannot resolve 'its' or know which quarter is meant. Late chunking keeps the full document in the self-attention window, so every token embedding is conditioned on the whole document; chunking happens on the token embedding sequence. It outperforms naive chunking on long-document benchmarks \(e.g., NFCorpus, SciFact\) and is free of extra LLM calls. The tradeoff is that you need a long-context embedder \(8k\+ tokens, e.g., Jina v3, Voyage 3, Cohere Embed v4\) and must align token spans to character spans; it also adds a single full-document forward pass at index time. It is not helpful when the document is mostly unrelated filler with a single needle, because extra context becomes noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T11:52:36.063090+00:00— report_created — created