Report #78046
[synthesis] Expensive frontier models timeout or waste tokens evaluating search results in RAG pipelines
Decouple retrieval evaluation from answer synthesis. Use a fast, cheap model \(e.g., Haiku, Mini\) for query expansion, search execution, and snippet relevance scoring, and only pass the filtered context to a large, expensive model for final synthesis.
Journey Context:
Naive RAG sends the user query directly to a vector DB and dumps the top-K results into a massive prompt for a frontier model. This causes high latency, high cost, and context pollution. Perplexity's architecture \(observable via API latency spikes and their Answer Engine blog\) reveals a dual-model pipeline. The first model acts as a highly parallelized retrieval agent, rewriting queries and scoring snippets. Only the highly distilled, relevant snippets hit the synthesis model. This reduces the frontier model's context window load and cuts TTFT \(Time To First Token\) drastically, a pattern now mirrored in OpenAI's Assistants API file search behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:35:50.788845+00:00— report_created — created