Report #81437
[frontier] Standard RAG with fixed-size chunking retrieves semantically incomplete fragments \(e.g., half a function definition\), forcing the agent to hallucinate missing context or make extra retrieval calls.
Adopt 'Late Chunking': Encode the entire document \(or long section\) with a long-context embedding model \(e.g., Jina AI's jina-embeddings-v3 or OpenAI's text-embedding-3-large with 8k context\) to obtain token-level or sentence-level embeddings. Chunk \*after\* embedding by grouping consecutive sentences with high embedding similarity, ensuring chunks respect semantic boundaries.
Journey Context:
The old way is 'split by 1000 characters with overlap'. This cuts through logical sections. 'Semantic chunking' often means 'clustering sentences' which is O\(N^2\) and loses order. 'Late Chunking' is a 2024/2025 pattern from Jina AI: you use the model's ability to attend across long context to get embeddings for smaller units \(sentences\) \*within\* the long context, then chunk based on the similarity trajectory. This is crucial for agents because they need to retrieve \*coherent units\* \(like a full function definition\) to act. Retrieving half a JSON leads to parse errors. This pattern is replacing 'parent document retrieval' \(which stores large docs but retrieves chunks\) because it is more precise. We evaluated Parent Doc \(high storage, imprecise boundaries\) vs Late Chunking \(better boundaries, requires long-ctx embedder\); the latter wins for agent tool use.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:17:11.475479+00:00— report_created — created