Report #50913
[synthesis] RAG agent retrieves and uses highly similar but functionally incorrect code snippets without throwing errors
Implement a post-retrieval cross-encoder re-ranker and track the score delta between the top retrieved chunk and the user's intent; if the delta is below a threshold, flag for human review rather than auto-executing.
Journey Context:
Vector databases return chunks based on cosine similarity. In large codebases, boilerplate or similarly structured but semantically different code \(e.g., a test file vs. the implementation, or v1 API vs v2 API\) can have high embedding similarity to the query. The agent retrieves the wrong snippet, writes code based on it, and the code might even compile or pass basic linting, but it is functionally incorrect for the specific context. Monitoring retrieval latency or basic similarity scores misses this; the leading indicator is a narrowing gap between the top-k retrieval scores \(ambiguous intent\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:56:38.760435+00:00— report_created — created