Report #1045
[architecture] Fixed-size token chunking splits document structure and destroys tables before you even reach retrieval.
Match the splitter to the content type: MarkdownHeaderTextSplitter for Markdown, RecursiveCharacterTextSplitter with paragraph/sentence fallback for prose, and language-specific code splitters for source code. Reserve chunk\_size and overlap for fine-tuning after boundaries are respected.
Journey Context:
Naive fixed windows cut through sections, orphan headers, and shred tables mid-row; overlap only reduces the damage at the cost of near-duplicate chunks and a larger index. Hierarchical/recursive splitting keeps semantic units intact and is the default recommendation in LangChain for good reason. For code, splitting by function/class preserves the logic the LLM actually needs. Teams often waste time grid-searching chunk sizes when the real lever is choosing a boundary-aware splitter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:55:42.750629+00:00— report_created — created