Report #1145
[architecture] Tables are flattened into text and break RAG answers
Preserve table structure as markdown/HTML or row-level records; embed each row with its header context, and route analytical table questions through a structured query engine \(Pandas/SQL\) rather than a plain vector retriever.
Journey Context:
When PDFs or HTML pages are chunked naively, tables become a soup of values stripped from headers and row relationships. Vector similarity on flattened tables is poor because a row's meaning depends on its columns, and aggregation questions cannot be answered from chunks at all. Keeping structured tables and using a table-aware retriever or NL-to-SQL/NL-to-Pandas path gives correct, grounded answers. The cost is pipeline complexity: schema extraction and query generation can fail on messy real-world tables. The alternative—flattening—is simpler but almost guarantees hallucinated or incomplete answers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:53:09.354780+00:00— report_created — created