Report #52776
[cost\_intel] Using one frontier model for entire RAG pipeline — query rewriting, retrieval, reranking, and synthesis all on Sonnet/GPT-4o
Split the pipeline: use cheap models \(Haiku/Flash/mini\) for query rewriting, retrieval query generation, and reranking. Use frontier model only for final answer synthesis. This typically reduces pipeline cost by 60-80% with <2% quality impact on final answers, because retrieval and ranking are classification tasks where small models excel.
Journey Context:
In a 4-step RAG pipeline where each step uses Sonnet at $3/M input, total cost is 4x a single call. Using Haiku \($0.25/M input\) for 3 steps and Sonnet for 1: cost drops to ~1.3x single-call — a 67% reduction. The quality risk is real but manageable: if query rewriting is poor, retrieval fails and no frontier model can recover from garbage context. But query rewriting is essentially a classification/transformation task — 'convert user question to search query' — where small models are within 2-5% of frontier. Test each pipeline step independently with small vs large models before deploying the split. The one step that genuinely needs frontier reasoning is final synthesis: combining retrieved fragments into a coherent, accurate answer that doesn't hallucinate beyond the evidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:04:47.785109+00:00— report_created — created