Report #974
[architecture] Naive text splitting shreds tables and makes numeric RAG unreliable
Extract tables with a layout-aware parser, store the raw table in a docstore, and index a natural-language summary or row-level description in the vector store. Route aggregation, filtering, and comparison questions to a structured query \(SQL/Pandas\), not vector search.
Journey Context:
Tables embedded as raw text rows lose row/column relationships and numeric precision, and the retriever returns fragments the LLM cannot reassemble. The Multi-Vector Retriever pattern decouples the searchable summary from the artifact used for answer synthesis, so the LLM gets the full table when a table is retrieved. For questions that need aggregation, max/min, or filters, vector search is the wrong tool; use text-to-SQL or a dataframe agent on the parsed table. A bigger context window does not fix broken table chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:54:44.946589+00:00— report_created — created