Report #616
[architecture] Fixed-token chunking destroys code semantics and API reference recall
Use syntax-aware splitters \(by function/class/import\) or semantic chunking with sliding boundaries; keep API signatures and docstrings in the same chunk.
Journey Context:
Fixed 512-token chunks split functions mid-body and separate signatures from their docstrings, making retrieval fail when agents search 'how do I call X'. Semantic chunking using embeddings to detect topic boundaries, or AST-based splitting for code, preserves logical units. The cost is slightly more preprocessing and variable chunk sizes, but recall for 'how-to' queries jumps dramatically compared to naive splits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:53:31.148406+00:00— report_created — created