Report #872
[architecture] Tables chunked like prose lose row and column relationships
Preserve tables as intact structured Markdown or HTML during parsing, generate a short table summary for the retriever, and never split a row across chunks.
Journey Context:
The most common ingestion mistake is running a text splitter over PDF or HTML without recognizing tables. A table row split across chunks becomes meaningless, and column headers lose their relationship to data cells. Instead, detect tables with a parser that outputs structured markup \(Markdown, HTML, or JSON\), keep each table whole, and attach metadata describing what the table contains. For very large tables, chunk by row groups while repeating headers, not by token count. Generate a brief natural-language summary of each table and store it alongside the structured representation so the embedding can match it to questions. This requires a parser that understands document layout; simple newline splitters are not enough. This pattern is essential for financial reports, scientific papers, and API reference tables.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:53:28.607164+00:00— report_created — created