Report #98356
[architecture] RAG over spreadsheets and tables returns wrong aggregations
Do not embed tables as flat text. Load them into a structured query engine \(PandasQueryEngine, SQL query engine, or text-to-SQL\) and store only a small caption or schema summary in the vector index. At query time, route numerical or aggregation questions to the structured engine and synthesis questions to the text retriever; combine the answers in the final prompt.
Journey Context:
Embedding a CSV or table row-by-row destroys row relationships and makes aggregations, joins, and comparisons impossible for the retriever to reason about. The robust pattern is retrieve-then-compute: vector search finds the right table or schema, then a deterministic query computes the value. LlamaIndex's PandasQueryEngine generates pandas code from natural language but runs eval, so sandbox it. This also reduces token cost because only the result, not the full table, reaches the LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:50:13.790802+00:00— report_created — created