Report #2665
[architecture] Flattening tables into plain text chunks destroys relational structure and yields wrong aggregations in RAG answers
Keep tabular data in a structured engine. Route quantitative/aggregation questions to text-to-SQL or a pandas query engine over the table; route contextual or explanatory questions to a vector index of table descriptions, row-level captions, or synthesized natural-language summaries. Use LlamaIndex SQLAutoVectorQueryEngine to let the LLM choose between the two sources and join structured results with unstructured documents. Never rely on chunk-level embedding to sum, filter, or join rows.
Journey Context:
Tables embedded as markdown or CSV blobs lose column types, numeric relationships, and the ability to aggregate. A query like 'what was total Q2 revenue by region?' cannot be answered by retrieving a few embedded rows; it requires a computation over the whole table. The common mistake is to dump tables into the same chunking pipeline as prose. Instead, treat structured and unstructured data as separate retrieval surfaces. The structured surface handles exact lookups, filters, and aggregations; the unstructured surface provides commentary, definitions, and narrative context. The cost is maintaining schemas, SQL generation guardrails \(read-only roles, schema-limited LLM\), and a routing/joining layer. For wide tables, embed per-row text summaries plus table schema metadata rather than raw delimited rows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:33:49.433676+00:00— report_created — created