Report #975
[architecture] Embedding dimension is a one-size-fits-all bottleneck for storage and latency
Use Matryoshka embedding models and truncate to a smaller prefix dimension at query time: short vectors for fast first-stage retrieval, full vectors for reranking or high-precision results. Only truncate models trained with the Matryoshka objective.
Journey Context:
Standard embeddings force a single dimension choice; high dimensions are accurate but expensive to store and search, while low dimensions hurt recall. Matryoshka Representation Learning trains nested representations so the first 64/128/256 dimensions are independently useful. This enables funnel search: cheap candidate generation with a short vector, then re-ranking with the full vector. A critical mistake is truncating a non-Matryoshka model's first N dimensions, which discards arbitrary signal rather than preserving semantics. Note that training and inference are not faster—only downstream storage and compute shrink.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:54:44.994993+00:00— report_created — created