Report #1239
[architecture] Using only dense retrieval gives low-precision top-k results for complex RAG queries.
Use a two-stage retrieve-then-rerank pipeline: a fast first-stage retriever \(BM25, dense, or hybrid\) returns a larger candidate set \(e.g., top-100\), then a cross-encoder reranker scores query-document pairs and returns the final top-k.
Journey Context:
First-stage retrievers optimize for speed and recall, so their top-k often contains false positives. Cross-encoders attend over the full query-document pair, making them far more accurate but too slow to run over the whole corpus. The standard pattern is retrieve-then-rerank. A common mistake is dropping reranking to save latency, which usually hurts answer quality more than expected. Tradeoff: reranking adds milliseconds per candidate; batch candidates and use a small model \(e.g., a MiniLM MS MARCO reranker\) if latency is tight. For very long documents, truncate or chunk before reranking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:54:26.320577+00:00— report_created — created