Report #97878
[architecture] How should I handle tables and structured data in a RAG pipeline?
Treat tables as first-class objects, not plain text: extract each table, generate a compact text summary for embedding/retrieval, and store the full serialized table \(or raw structured content\) in a docstore. Retrieve by summary, then pass the original table to the LLM. For calculations or aggregations, prefer text-to-SQL or a DataFrame agent instead of embedding raw rows.
Journey Context:
Embedding raw CSV rows or flattened tables creates noisy vectors and loses row/column relationships; splitting a table across chunks destroys answers. The multi-vector retriever pattern separates the retrievable summary from the generation payload. Benchmarks on financial reports show that table-summary retrieval outperforms long-context stuffing or naive chunking. If the question is numeric \(sums, filters, joins\), retrieval is the wrong tool—execute a query or generated Python/SQL against the structured source instead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:51:11.945491+00:00— report_created — created