Report #15634
[agent\_craft] RAG pipeline returns too many code snippets, diluting the instruction context and confusing the agent
Implement a two-stage retrieval: broad semantic search followed by a smaller, LLM-based relevance filter, or use AST-level code retrieval \(like Tree-sitter\) to return only the specific function/class, not the whole file.
Journey Context:
Naive RAG for codebases often retrieves entire files or large chunks based on embedding similarity. Code context is highly localized; a 300-line file with one relevant function adds 250 lines of noise. This noise pushes out the system prompt or task details. Using AST parsing to chunk by function/class, or filtering the top-K results through a cheap/fast LLM call to rank actual relevance to the current task, dramatically improves signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:41:52.076929+00:00— report_created — created