Report #74375
[synthesis] Vector-only RAG for codebases returns disjointed code snippets lacking structural context
Combine local AST parsing \(Tree-sitter\) for structural navigation with vector embeddings for semantic search, prioritizing symbol definitions and providing the LLM with the call graph, not just raw text chunks.
Journey Context:
Sourcegraph's Cody and Cursor both reveal that pure vector search over code is insufficient. A vector search might return a function body but miss the class definition or the import. Cursor's @codebase observable latency and Sourcegraph's architecture docs show a hybrid approach: Tree-sitter parses the code into an AST to understand symbols and references \(the graph\), while embeddings handle the fuzzy semantic search. When the LLM needs context, the system uses the embedding to find the entry point, then traverses the AST to pull in the surrounding class/interface definitions, ensuring the LLM sees structurally valid code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:26:07.169793+00:00— report_created — created