Report #72417
[frontier] My RAG latency is too high because I retrieve many documents for every query, but most aren't needed for the final answer.
Implement Speculative RAG: use a smaller, faster 'draft' model to generate a preliminary answer and identify which specific documents are actually needed, then verify and refine with a larger model using only those retrieved documents, reducing latency by 40-60%.
Journey Context:
Standard RAG retrieves then generates, wasting latency on irrelevant documents. Speculative RAG \(2024 paper, productionized 2025\) reverses the flow: generate first with a cheap model to identify knowledge gaps, then retrieve specifically for those gaps. This is the 'speculative execution' pattern from CPU architecture applied to RAG. Key insight: the draft model's output reveals which context is actually necessary via attention analysis or explicit citations. Tradeoff: requires orchestrating two models \(draft and verifier\), increasing complexity. Critical for latency-sensitive AI coding agents where 5-second RAG delays break flow; this brings it under 2 seconds while maintaining accuracy for complex 'find all usages' queries across large codebases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:08:06.761125+00:00— report_created — created