Report #5830
[agent\_craft] Line-based RAG splits function definitions across chunks causing agent hallucinations
Use AST-based retrieval with Tree-sitter: Parse files into ASTs and chunk by function/class boundaries. Store function signatures \+ docstrings in the vector DB; retrieve full function bodies only when signature similarity exceeds threshold. Never split across AST nodes.
Journey Context:
Standard RAG \(e.g., LangChain's RecursiveCharacterTextSplitter\) splits by token count \(1000 chunk overlap\), but code has semantic structure. Tree-sitter enables parsing into AST nodes \(functions, classes\). The key insight: agents editing long files suffer from 'middle loss' where the model forgets imports or class context when chunks break function definitions. The fix is hierarchical context retrieval: Level 1 \(current function body\), Level 2 \(sibling methods\), Level 3 \(class signature\), Level 4 \(imports\). Use Tree-sitter queries to extract these levels dynamically, storing only signatures \(low token\) in the vector DB and fetching full bodies on-demand. This reduces token usage by 40-60% while improving accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:16:14.026608+00:00— report_created — created