Report #79488
[cost\_intel] Optimizing retrieval-augmented generation by using reasoning models for query expansion HyDE
Use GPT-4o-mini or Gemini Flash for HyDE \(Hypothetical Document Embedding\) generation and query expansion; avoid o1/o3 unless the retrieval requires complex multi-hop reasoning across the hypothetical documents. The cost difference is 20-50x with minimal retrieval quality impact.
Journey Context:
Advanced RAG implementations use HyDE to generate hypothetical perfect answers for embedding-based retrieval. Teams mistakenly route this through o1, generating elaborate, 'reasoned' hypothetical documents that actually hurt retrieval—vector DBs match on keyword density and semantic similarity to real docs, not logical structure. 4o-mini generates simple, content-rich hypotheticals that retrieve better \(higher recall@k\) at 1/50th cost. The exception: if the query requires arithmetic or logic to form the hypothetical \(e.g., 'find documents about revenue growth adjusted for inflation'\), o1's reasoning helps construct the right hypothetical query. Latency matters: HyDE is on the critical path for RAG; o1 adds 15-30s to every query.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:01:27.417208+00:00— report_created — created