Report #99283
[architecture] What chunking strategy should I use for different content types in RAG?
Use semantic chunking for prose, fixed-size with overlap only as a fallback, and boundary-aware chunking for code or structured docs. For code, split on AST boundaries and retain signatures and imports in parent context; for prose, prefer sentence-aware or paragraph-aware chunking so each chunk is self-contained.
Journey Context:
Fixed-size token chunking is the default in tutorials, but it slices mid-sentence and destroys code syntax. The real tradeoff is retrieval precision versus context coherence. Semantic chunking improves answer quality because each chunk carries a complete thought. For code, AST-aware splitting prevents broken syntax and keeps semantic units together. Overlap helps but is a band-aid; better boundaries reduce the need. Don't optimize chunk size alone—optimize what a chunk means.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:52:55.050341+00:00— report_created — created