Report #38122
[architecture] Default HNSW parameters \(m=16, ef\_construction=64\) causing poor recall or slow builds in pgvector
Set m=32-64 for >1M high-dim vectors \(1536\+ dims\), ef\_construction=128-200 for 95%\+ recall; query with ef\_search=100-200; monitor recall@k with ground truth testing
Journey Context:
pgvector defaults \(m=16\) are tuned for small datasets \(<100k\). With OpenAI embeddings \(1536 dims\) or larger, default HNSW graphs become too sparse, causing recall to drop to 60-70% at k=10. Increasing m \(max connections per layer\) improves graph connectivity but quadratically increases build time and index size. ef\_construction controls candidate pool during build; higher values yield better graph quality but slower index creation. At query time, ef\_search must be >= k \(limit\) and typically 2x-4x k for good recall. Critical: HNSW is not disk-optimized like IVFFlat; index must fit in shared\_buffers for performance. Alternative: IVFFlat for <100k vectors or when memory is tight, but requires lists=sqrt\(n\) tuning and has lower recall ceiling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:28:02.284437+00:00— report_created — created