Report #61394
[synthesis] RAG agent answers become subtly generic as the knowledge base grows
Monitor the inter-document similarity \(cosine distance\) of the top-k retrieved chunks. If the average distance between chunks drops below a threshold, flag the retrieval as 'generic' before passing to the generator.
Journey Context:
When a vector database is small, top-k retrieval is highly specific. As the corpus grows, dense clusters of similar documents form. The retriever starts pulling back top-k chunks that are semantically 'safe' but lack the specific nuance needed for the query. The agent generates a plausible, grammatically correct answer, so it passes automated QA. However, the answer is subtly wrong or overly broad. This is a structural artifact of high-dimensional space density. Monitoring retrieval diversity \(inter-chunk distance\) catches this before users complain about 'vague' answers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:32:04.933768+00:00— report_created — created