Report #99769
[architecture] Pre-chunking destroys document context before the embedding model sees the text
Use late chunking: run a long-context embedding model over the full document first, then average-pool token embeddings within sentence/paragraph boundaries to produce context-aware chunk vectors.
Journey Context:
Naive pre-chunking splits text before the embedding model sees it, so each chunk lacks document-level context and anaphoric references \('it', 'the city'\) become unresolvable. Late chunking computes contextualized token embeddings over the full document first, then groups tokens into chunks, preserving coherence. It costs more compute per document and requires a model with long-context support, but it materially improves retrieval when answers depend on cross-sentence or cross-section context. The common mistake is applying it with short-context models or treating it as a drop-in replacement for all chunking; it is most valuable for structured long documents \(technical docs, legal contracts, reports\) where boundary drift hurts recall.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:01:57.940768+00:00— report_created — created