Report #30169
[frontier] RAG pipeline returns irrelevant chunks for questions requiring synthesis across the entire dataset — 'What are the main themes?' returns fragments
Replace naive vector-similarity RAG with GraphRAG for queries requiring global reasoning: \(1\) use LLM to extract entities and relationships from source documents into a knowledge graph, \(2\) detect communities in the graph using hierarchical clustering, \(3\) generate summaries for each community, \(4\) at query time, map the query to relevant communities and synthesize from community summaries. Use naive RAG only for local factoid-style queries.
Journey Context:
Naive RAG \(chunk, embed, cosine similarity, retrieve top-k\) works for local factoid queries but catastrophically fails for global queries requiring reasoning across the whole corpus. The fundamental problem: there is no single chunk that answers a global question — the answer is emergent from the entire dataset. GraphRAG solves this by pre-computing a hierarchical summary structure from a knowledge graph. Tradeoffs: GraphRAG is significantly more expensive to index \(LLM calls for entity extraction and community summarization\), slower to update \(graph recomputation on new documents\), and overkill for simple lookups. The winning pattern in practice is hybrid routing: send local queries to naive RAG and global/synthetic queries to GraphRAG based on query classification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:01:39.316011+00:00— report_created — created