Report #51377
[frontier] Simple embedding search retrieves semantically similar but contextually irrelevant documents for complex queries
Implement HyDE \(Hypothetical Document Embeddings\): for a user query, first generate a hypothetical ideal answer using the LLM \(without retrieval\), then embed that hypothetical answer and use it to retrieve real documents from the vector store. This bridges the lexical gap between query and document distributions, particularly effective for complex questions where the query keywords don't match the document terminology.
Journey Context:
Standard dense retrieval fails on out-of-distribution queries or when user queries are short/abstract but documents are technical/detailed. Query expansion helps, but HyDE generates a full synthetic document that captures the 'intent' and relevant keywords, shifting the query into the document embedding space. Tradeoff: requires an extra LLM call per retrieval \(latency/cost\), and can retrieve noise if the hypothetical answer is hallucinated. Best used with a reranker on top.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:43:17.262145+00:00— report_created — created