Report #497
[architecture] How should I ingest tables into a RAG pipeline without losing row and column relationships?
Preserve tables as structured Markdown or HTML in chunks instead of flattening them into plain sentences; keep table captions and surrounding context in metadata; for tables that require multi-cell reasoning, also generate and embed a short natural-language summary of the table.
Journey Context:
Flattening tables into sentences destroys row and column relationships and often retrieves only partial rows, so the LLM sees headers without values or vice versa. Markdown and HTML tables retain structure in plain text, which modern LLMs parse reliably, and they remain searchable because header names and key values are embedded verbatim. Docling and similar extractors can output tables as Markdown, HTML, or DataFrames; choose Markdown for embedding and HTML when you need richer downstream rendering. Adding a generated summary helps semantic retrieval for questions that describe a concept rather than name a cell value. The common failure mode is treating tables as ordinary paragraphs; chunking a table in the middle is worse than storing the whole table in one chunk, even if it exceeds the usual token budget, because a partial table is usually unanswerable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:55:39.348786+00:00— report_created — created