Report #1107
[architecture] What chunking strategy should I use as the default in a RAG pipeline?
Start with a structure-aware recursive splitter \(e.g., LangChain RecursiveCharacterTextSplitter or a Markdown-header-aware parser\), targeting 400–512 tokens with 10–20% overlap. Only move to embedding-based semantic chunking if retrieval metrics justify the extra latency and cost.
Journey Context:
Fixed-size chunks are easy to implement but routinely split sentences and paragraphs, washing out context. Semantic chunking respects topical boundaries but is slower, costs an embedding call per boundary decision, and produces wildly variable chunk sizes that hurt batching. Recursive splitting tries paragraphs, then sentences, then words, so it keeps natural units intact while still fitting the model token budget. For code, add separators like '\\nclass ' and '\\ndef '; for PDFs, page-level or layout-aware parsers often beat generic chunkers. Measure recall on your own documents before choosing exotic strategies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:55:11.096853+00:00— report_created — created