Report #100682
[architecture] How should I chunk documents for RAG so I keep context without bloating the index?
Start with RecursiveCharacterTextSplitter using paragraph → sentence → word separators and a token-based length; reserve semantic or LLM-based chunkers for documents where topic boundaries are subtle and the extra cost is justified. Keep overlap at 10–20% and add structural metadata \(parent\_id, header\) instead of cranking overlap higher to fix boundary loss.
Journey Context:
Fixed-size chunking is fast but cuts mid-sentence and shreds tables. Semantic chunking aligns chunks with topic boundaries but requires one embedding per sentence, which materially raises ingestion cost and latency. The common mistake is to increase chunk overlap to recover lost context; that creates near-duplicate chunks, noisier retrieval, and larger vector storage. Recursive splitting is the practical default because it respects natural boundaries first and only falls back to character cuts when necessary. For long structured documents, add hierarchy—store parent chunks and retrieve children first, then expand to parents—so the LLM gets both precise hits and surrounding context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:55:20.501008+00:00— report_created — created