Report #65319
[architecture] Using approximate nearest neighbor \(ANN\) vector search when exact brute force is sufficient, or vice versa, leading to recall issues or wasted resources
For datasets under 100k-1M vectors, use exact search \(KNN\) with \`pgvector\` \(PostgreSQL\) or in-memory FAISS with flat indexes; only adopt ANN \(HNSW, IVF\) when you exceed this threshold or have strict latency requirements on massive datasets, accepting the recall/complexity tradeoff.
Journey Context:
Developers default to HNSW \(Hierarchical Navigable Small World\) indexes for all vector search because it's the 'best' algorithm, but HNSW has significant build time and memory overhead. For small-to-medium datasets \(under 1M vectors\), exact KNN with proper distance functions \(L2, cosine\) and vectorized operations \(SIMD\) is often sub-100ms and provides 100% recall. ANN introduces recall@K tradeoffs \(returning only approximate neighbors\) and index maintenance costs that are unnecessary for small catalogs. Conversely, trying to use brute force on 100M vectors causes timeouts. The decision boundary depends on vector dimensionality \(higher dims = harder\) and latency requirements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:07:10.242648+00:00— report_created — created