Report #99770
[architecture] Flattened tables break multi-row reasoning in RAG retrieval
Preserve tabular structure during ingestion: extract tables as HTML or Markdown, index each table \(or logical row-group\) as a single retrieval unit along with a generated natural-language summary, and retrieve the whole table for the generator.
Journey Context:
When tables are collapsed into sentences, row-to-row relationships and column alignment are lost, so LLMs cannot answer aggregation, comparison, or trend questions. The fix is structure-aware parsing: use a table extractor \(Unstructured, Camelot, Marker, LlamaParse\) to emit HTML/Markdown, embed the structured table plus a summary, and retrieve the entire table rather than individual cells. Tradeoff: tables consume more context budget than prose, so split very wide tables by logical section and keep a row-level fallback for large tables. The anti-pattern is treating table text like narrative text and embedding each row independently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:01:59.481436+00:00— report_created — created