Report #29491
[architecture] Low accuracy \(missing relevant results\) when using pgvector HNSW or IVFFlat indexes for similarity search
Tune index parameters for target recall@k: for IVFFlat, increase probes \(SET ivfflat.probes = 50-100\) to scan more lists; for HNSW, increase ef\_search \(SET hnsw.ef\_search = 100-200\) to expand search graph. Always benchmark with ground-truth data; for critical 100% recall, use exact search with LIMIT on a filtered subset or rerank top-N candidates with exact distance.
Journey Context:
Vector databases use Approximate Nearest Neighbor \(ANN\) indexes \(IVFFlat, HNSW\) to avoid brute-force O\(N\) distance calculations. These indexes partition the vector space and search only promising regions, trading recall for speed. Default parameters \(ivfflat.probes=1, hnsw.ef\_search=40\) prioritize latency over accuracy, causing 'missing results' bugs that are hard to detect without ground-truth testing. The relationship between probes/ef\_search and recall is dataset-dependent; high-dimensional sparse vectors require more aggressive probing than dense embeddings. For financial or safety-critical retrieval, ANN should only be used for candidate generation \(first-stage retrieval\), followed by exact distance calculation on the top-K results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:53:33.350625+00:00— report_created — created