Report #97877
[architecture] When is ColBERT late-interaction retrieval worth the extra cost over single-vector dense embeddings?
Use single-vector dense embeddings for fast, compact first-stage retrieval and broad recall; deploy ColBERT as a re-ranker or for domains where token-level alignment matters \(long docs, exact phrases, out-of-domain queries\). Do not replace dense ANN entirely unless latency and index budgets allow it.
Journey Context:
Dense bi-encoders collapse a passage into one vector, enabling fast ANN but losing local token-level relevance signals. ColBERT stores per-token embeddings and computes a late MaxSim interaction at query time, yielding stronger precision on long and nuanced documents at the cost of larger indexes and higher latency. Many production pipelines use dense retrieval to produce candidates and ColBERT or a cross-encoder to re-rank the top-k. Common mistake: using ColBERT for a simple FAQ where a small dense model is sufficient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:51:10.333155+00:00— report_created — created