Report #873
[architecture] Single-vector dense embeddings trade away fine-grained token matching
Use ColBERT late-interaction retrieval when the workload requires high recall on precise phrases and you can accept higher latency, index size, and serving complexity.
Journey Context:
Bi-encoder dense embeddings compute one vector per document and one per query, making retrieval fast and cheap but collapsing all token-level evidence into a single similarity score. ColBERT keeps per-token representations and performs a lightweight late interaction between query and document tokens, which dramatically improves ranking for exact phrases, rare terms, and long documents. The cost is a larger index, slower retrieval, and more complex deployment than a single-vector system. ColBERT is the right call for high-stakes search over long technical documents where precision matters more than latency; it is usually overkill for simple FAQ bots or small document sets. If full ColBERT is too heavy, a cross-encoder reranker over candidate dense results captures much of the benefit with lower serving cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:53:28.658285+00:00— report_created — created