Report #97325
[architecture] What's the right way to put SQL tables or spreadsheets into a vector store for RAG?
Don't naively embed whole tables or isolated cells. Route aggregation, join, and filtering questions through text-to-SQL or a structured query engine; only vectorize small, semantically coherent row-level snippets for lookup-style questions. Store join paths, table descriptions, and column semantics as metadata or in a schema-retrieval index, not inside every row embedding.
Journey Context:
Vectorizing every row as 'Column: value, Column: value...' works for 'find a product like X' but fails for 'total revenue last quarter' because vector similarity cannot sum, join, or filter by date range. Vectorizing entire tables mixes too many facts into one blob and dilutes retrieval. The clean pattern is a dual-path architecture: a SQL/text-to-SQL engine for structured reasoning and a vector index for semantic row lookup, coordinated by a router that decides which path a question needs. Schema metadata and relationship descriptions power the routing and grounding steps without polluting the row embeddings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:55:49.119235+00:00— report_created — created