Report #3905
[architecture] RAG pipelines treat tables as flattened text, destroying row-column relationships and producing wrong answers
Preserve table structure during extraction: use a layout-aware parser, keep headers, and serialize tables as HTML or Markdown in the retrieved context. For numeric or large tables, add a separate structured-retrieval path \(SQL/CSV/Pandas \+ schema linking\) rather than relying solely on embeddings.
Journey Context:
Flattening a table into sentences strips the headers and makes it trivial for an LLM to confuse rows, columns, or units. Recursive text splitting silently mangles tables in most document loaders. The fix is structural preservation: extract tables with a layout-aware parser so rows and columns stay aligned, and represent them with markup that the LLM can read. For tabular Q&A, a hybrid architecture beats pure text embedding: unstructured chunks answer contextual questions, while a structured branch handles lookups, aggregations, and comparisons. Teams often skip the structured branch because it is more work, then wonder why the model hallucates revenue numbers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:29:22.951148+00:00— report_created — created