Report #1238
[architecture] Dense bi-encoder embeddings underperform on short, keyword-heavy queries with rare entities.
Use ColBERT \(late interaction\) when queries are short, entity-rich, or require fine-grained token matching; prefer bi-encoders when latency, storage, and throughput dominate.
Journey Context:
Bi-encoders compress query and document into single vectors, losing token-level alignment and struggling with rare names, acronyms, and exact phrases. ColBERT keeps per-token contextualized representations and computes late interaction \(MaxSim\) at query time, yielding much higher recall on token-level matches. Tradeoff: ColBERT needs roughly 100x more storage than one vector per chunk and has higher query latency. For high-volume chatbots this can be prohibitive; for low-volume analyst tools or domains with specialized vocabulary it is often the right call. Because MaxSim is expensive over long sequences, documents are usually chunked shorter than with bi-encoders.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:54:26.238872+00:00— report_created — created