Report #812
[architecture] Chunking strategy for long documents in RAG: fixed-size tokens vs semantic/hierarchical chunks
Default to fixed 512–800 token chunks with ~50% overlap for homogeneous prose. For mixed documents \(markdown, code, FAQs\), upgrade to semantic chunking \(embed adjacent sentence boundaries and split when similarity drops\) or hierarchical chunking \(small child chunks with a parent chunk fed to the LLM\). Always preserve structural boundaries \(paragraphs, headers, code blocks\) and validate by measuring retrieval recall, not just chunk size.
Journey Context:
Fixed-size chunking is easy to implement but slices across sentences, code blocks, and semantic boundaries, polluting embeddings with unrelated content and losing cross-chunk context. Semantic chunking keeps related ideas together by splitting at similarity drops, while hierarchical chunking lets you retrieve precise child chunks and then expand to parent context. OpenAI's File Search uses 800-token chunks with 400-token overlap as a safe prose baseline, and LlamaIndex's SemanticSplitterNodeParser and HierarchicalNodeParser are the standard upgrades. The trap is chasing ever-smaller chunks: below ~200 tokens you lose context, while very large chunks dilute the signal and eat LLM budget. Tune chunk size and overlap with an end-to-end retrieval metric \(e.g., recall@k on held-out QA pairs\), not by eyeballing text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:53:40.208345+00:00— report_created — created