Report #2668
[architecture] Storing full-dimension embeddings for every chunk inflates vector index size and latency without proportional accuracy gains
If your embedding model supports Matryoshka Representation Learning \(e.g., OpenAI text-embedding-3-small/large\), request a smaller dimensionality with the dimensions API parameter. Start with 512 dims for first-pass retrieval and benchmark recall against 256/1024; keep the full vector only if your evaluation shows a meaningful gain. Always use the same dimension at index and query time; normalize truncated vectors if you switch after indexing.
Journey Context:
Standard embeddings treat all dimensions as equally important, so you must store the full vector. Matryoshka-trained models front-load the most discriminative information, making any prefix a valid embedding. OpenAI reports that text-embedding-3-large shortened to 256 dims outperforms full ada-002 at 1536 dims on MTEB. The architectural implication is that vector storage, memory, and ANN latency become tunable knobs rather than fixed costs. The tradeoff is a small recall loss for very short prefixes, and not every model supports truncation—using PCA on a non-MRL model is not equivalent and usually hurts. The safe default is 512 dims for large corpora and 256 only when speed dominates accuracy. Evaluate on your own query set, not just MTEB.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:33:49.607632+00:00— report_created — created