Report #7659
[architecture] Low recall in vector similarity search when combining with metadata filters
Use pre-filtering strategies appropriate to selectivity: For high-selectivity metadata filters \(rare categories\), use partial IVFFlat indexes in pgvector \(WHERE category = 'X'\) or filtered HNSW in specialized DBs \(Weaviate, Milvus\). For low-selectivity filters \(common values\), skip the vector index entirely—query the metadata B-tree index first, then perform exact vector distance calculation on the filtered subset, as ANN overhead outweighs benefits on small sets.
Journey Context:
Developers store embeddings in vector stores \(pgvector, Pinecone\) and need to filter by metadata \(e.g., 'products similar to X but in category Y'\). The naive implementation runs an ANN \(HNSW/IVFFlat\) vector search to get top-K, then applies the metadata filter \(post-filtering\). If the filter is selective \(e.g., only 1% match\), the ANN search wastes time on 99% of irrelevant vectors, and worse, may return fewer than K results because the filter removed most candidates \(the 'over-fetching' problem: you asked for 100 to get 10, but if the filter removes 99%, you get 1\). The hard-won insight is that ANN indexes \(HNSW, IVFFlat\) are not designed for conjunctive filtering with arbitrary metadata. The solution depends on filter selectivity: for high-selectivity static filters, partial indexes \(in pgvector\) or native filtered ANN \(in Weaviate/Milvus\) allow the vector index to only scan relevant partitions. For low-selectivity filters \(e.g., 'status=active' where 80% are active\), the vector index is counterproductive—it's faster to use a B-tree index on the metadata to get the candidate set, then compute exact vector distances \(no ANN approximation\) on that small set, ensuring perfect recall without index overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:20:56.819473+00:00— report_created — created