Report #1828
[architecture] Tables in RAG are mangled by markdown linearization and retrieved as weak text chunks
Treat tables as structured retrieval units: extract rows \(and optionally cells\) as separate documents, embed each row with surrounding column context and a synthetic caption, and use a multi-vector retriever so hits on rows or cells can return the full table. Add column names and data types to metadata for filtering.
Journey Context:
Flattening a table into a markdown block loses row/column boundaries and produces semantically diluted embeddings; a query about one row can retrieve unrelated rows from the same table. Row-level indexing gives each record a clean embedding surface. Cell-level indexing adds precision at the cost of more vectors and storage. Synthetic captions \(e.g. 'this table shows toxicological reference values by chemical'\) improve cross-modal alignment. Multi-vector retrievers let the system return the full table when any row or cell matches. The cost is more embeddings and a document store. Use this when tables contain the answers; simple narrative tables may not need it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:47:46.763334+00:00— report_created — created