Report #98830
[architecture] ColBERT is too slow for first-stage retrieval in RAG
Use ColBERT only as a reranker, not as the first-stage retriever. Retrieve a larger candidate set with a cheap bi-encoder or hybrid search, then rerank the top 100-200 with ColBERT.
Journey Context:
ColBERT's token-level late interaction is far more expressive than single-vector dense retrieval because it compares every query token to every document token. That same expressiveness makes it too expensive to scan a large corpus at query time. The standard pattern is a two-stage pipeline: a fast bi-encoder \(dense\) or sparse\+dense hybrid retrieves a few hundred candidates, and ColBERT reranks them. This gives most of ColBERT's accuracy gain at a fraction of the latency. Modern ColBERT variants add compression and indexing tricks, but the architecture pattern remains: late interaction belongs in reranking, not first-stage retrieval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:51:11.531672+00:00— report_created — created