Report #1049
[architecture] Flattening tables into plain text chunks destroys row-column relationships and makes numeric retrieval unreliable.
Preserve table structure by parsing into Markdown/HTML or a layout-aware representation. Use a multi-vector/parent-document pattern to embed retrievable summaries or hypothetical questions while returning the full table to the LLM, and route aggregation/numeric questions to text-to-SQL or Pandas instead of vector search.
Journey Context:
Tables carry meaning through headers, alignment, and cell proximity; once you split them into token windows, the LLM cannot reliably reconstruct which value belongs to which row. Layout-aware parsers output structured markup that keeps rows intact. For databases and CSV, semantic search is the wrong tool for aggregation; a query router that sends analytical questions to SQL/Pandas is far more accurate. The multi-vector retriever pattern is the standard way to keep small searchable units while still passing complete objects into context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:56:43.624016+00:00— report_created — created