Report #548
[architecture] How should I design metadata schemas for vector databases in RAG?
Design metadata around query-time filters, not document taxonomy. Add a small, stable set of high-selectivity fields such as source, date, doc\_type, and tenant or owner. Avoid large unbounded text lists, deeply nested objects, and payloads that exceed the vector DB's metadata limits.
Journey Context:
The common mistake is dumping every extracted entity into metadata because the vector database supports it. This bloats indexes, slows filtering, and breaks when schemas drift. Vector DB metadata is an index and filter layer, not a document store. High-selectivity categorical fields provide the biggest latency and recall wins because they pre-filter the ANN search space. Date ranges, tenant IDs, and document types are classic examples. Unbounded arrays and long text belong in a separate document store, not as indexed metadata. Plan the schema before bulk ingestion because re-indexing metadata is expensive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:52:23.060376+00:00— report_created — created