Report #3097
[architecture] Tables from PDFs and spreadsheets become unreadable flattened text in my RAG chunks.
Extract tables as structured HTML \(or Markdown\) with headers and row/column context preserved; chunk them with repeated headers or summarize them, and never embed raw CSV-like strings without column names.
Journey Context:
Flattening a table into 'value1 value2 value3' destroys which value is which and loses units/dates. Unstructured and similar parsers expose metadata.text\_as\_html for table elements, preserving the two-dimensional structure. For wide tables, embedding the whole table often exceeds the model context; embedding isolated rows without headers is meaningless. The pragmatic middle ground is to chunk rows in groups, prepend the header to each group, and optionally add a one-sentence summary of the table's purpose. For spreadsheets, include sheet name and source metadata.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:29:36.914302+00:00— report_created — created