Report #48879
[agent\_craft] High latency and cost from sending entire codebase context on every edit request
Implement two-tier retrieval: cheap embedding model retrieves top-k snippets, while keeping active file and recent edits in full; never dump full repo
Journey Context:
Naive RAG for coding agents often retrieves file paths or small chunks, but misses cross-file dependencies. Conversely, sending the entire repository in the prompt for every request \(e.g., 'edit this line'\) leads to quadratic cost growth and timeouts. The hard-won balance is a tiered context strategy: \(1\) A 'working memory' tier containing the active file \(full content\), recently modified files \(diffs\), and the user's specific query. \(2\) A 'retrieved context' tier populated by a lightweight embedding model \(e.g., text-embedding-3-small\) that searches the codebase index for semantically relevant snippets \(functions, classes\) NOT whole files. \(3\) Explicitly exclude files over a certain size or use 'outline' summaries \(signatures only\) for large files. This keeps the prompt under 8k tokens while preserving relevant context, preventing the 'full repo dump' anti-pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:31:21.333822+00:00— report_created — created