Report #53479
[cost\_intel] Using text-embedding-3-large for clustering tasks where dimensionality collapse hurts silhouette scores, wasting 5x cost
Use text-embedding-3-small \(512 dims\) or ada-002 for clustering and anomaly detection; reserve text-embedding-3-large \(3072 dims\) for asymmetric retrieval \(query vs long document\) where MRR@10 matters; reduce dimensions with PCA for clustering if using large model
Journey Context:
Large embedding models capture fine-grained semantic distinctions necessary for retrieval, but clustering algorithms \(k-means, HDBSCAN\) suffer from curse of dimensionality with 3k-dim vectors. Cost: large is $0.13/1M vs small $0.02/1M. Quality signature: if silhouette score decreases when switching to large model, you have dimensionality mismatch. For retrieval, always use large with Matryoshka representation learning \(truncate to 256/512 dims if needed\). Clustering needs dense, low-dim spaces; retrieval needs high-dim sparse similarity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:15:40.127249+00:00— report_created — created