Report #4992
[architecture] What is the right way to include tables and structured data in a RAG corpus?
Convert tables to structured text \(Markdown/HTML\) and index each row with surrounding context as a separate chunk, or convert rows to pseudo-sentences; do not flatten whole tables into one blob and do not drop headers. For complex multi-page tables, add a table summary chunk plus per-row chunks with cross-references.
Journey Context:
Naive PDF-to-text pipelines turn tables into unreadable whitespace soup, destroying numeric relationships. The opposite mistake is embedding an entire 50-row table as one chunk: it dominates context windows and usually exceeds embedding model input limits. The robust pattern is row-centric retrieval: each row becomes a chunk prefixed by column headers and a short table caption, so a query about 'Q3 revenue' matches the exact row. If relationships across rows matter, keep a lightweight summary chunk that names the table and its columns. CSV/Excel pipelines benefit from converting rows to templated sentences \(e.g., 'In 2023, product X had revenue Y in region Z'\) because embedding models encode sentence-like text better than raw delimited rows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:28:20.577901+00:00— report_created — created