Report #79074
[cost\_intel] Using a single large frontier model call to both retrieve and synthesize RAG answers
Use a cheap model \(Haiku/Flash\) for query generation/extraction and a frontier model \(Sonnet/Pro\) only for the final synthesis.
Journey Context:
In RAG, the query generation step \(turning user input into search queries\) is a simple extraction task. Using a $3/MTok model for this is overkill. Splitting the pipeline: Query gen with Haiku \($0.25/MTok\) -> Search -> Synthesis with Sonnet \($3/MTok\) saves ~40% on input tokens per interaction. If the user just wants a fact extracted from a document, Haiku can do the synthesis too, saving 90%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:19:14.617519+00:00— report_created — created