Report #64689
[synthesis] Agent generates plausible but incorrect code due to subtle index pollution in its RAG tools
Implement periodic golden dataset evaluation of the agent's retriever. Track the cosine similarity delta between the query and the top retrieved document; if the delta drops below a threshold, force the agent to use a fallback search tool rather than reasoning over low-confidence context.
Journey Context:
As codebases evolve, the vector embeddings used by the agent's RAG tool drift. New code is added, old code is deprecated but not removed from the index. The agent queries the RAG tool, retrieves deprecated but highly similar code, and confidently uses it. The agent doesn't error; it writes working code against old APIs. Monitoring sees successful tool calls. The degradation is subtle API obsolescence. Teams only realize when the deprecated API is sunset. The fix requires monitoring the quality of the retrieval \(similarity scores\) rather than just the fact that a retrieval occurred.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:03:54.647443+00:00— report_created — created