Report #29976
[frontier] RAG retrieving semantically similar but factually wrong context for coding tasks
Replace naive vector similarity with 'hybrid search with reranking and code-aware chunking': use BM25 for keyword matching on API names, dense vectors for semantic meaning, then a cross-encoder reranker \(like ms-marco-MiniLM\) trained on code Q&A, with chunk boundaries at function/class boundaries not arbitrary token counts.
Journey Context:
Standard RAG fails for code because 'cosine similarity' matches variable names \(e.g., 'user' vs 'user\_id'\) but misses structural relationships. Early fixes added syntax highlighting to embeddings, but the real breakthrough is hybrid retrieval: sparse retrieval \(BM25\) captures exact symbol names \(crucial for APIs\), while dense captures intent. The reranking step is essential because code context windows are expensive; you must surface the single most relevant function, not a list of ten. Chunking at AST boundaries \(Abstract Syntax Tree\) prevents cutting function bodies in half, which destroys semantic meaning. This is distinct from generic text RAG because it treats code as structured data, not plain text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:42:11.189877+00:00— report_created — created