Report #75609
[cost\_intel] Sending full documents to frontier models for factual Q&A when retrieval plus small model suffices
For factual Q&A over documents, use embedding-based retrieval to select relevant chunks, then generate answers with Haiku or Flash. This yields comparable quality to frontier models at 10-50x lower cost. Reserve frontier models for Q&A requiring synthesis across many documents or complex reasoning about retrieved content.
Journey Context:
The default RAG pattern retrieves chunks and sends them to the most capable model. But for factual extraction questions like refund policy or contract dates, a small model with the right chunk performs identically to a frontier model. The frontier model only adds value when the answer requires combining information across chunks, resolving contradictions, or applying complex reasoning. Benchmark your RAG pipeline: if over 80% of queries are factual, the small model handles them fine. Route only the remaining 20% involving synthesis, comparison, or reasoning to the frontier model. The cost reduction on the generation step goes from roughly $3 per million input tokens on Sonnet to roughly $0.25 per million on Haiku. Over millions of queries this is a six-figure savings difference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:30:34.888575+00:00— report_created — created