Report #90217
[cost\_intel] Haiku pre-filtering RAG pipelines wasting frontier model calls
Use Claude 3 Haiku to pre-filter and rerank documents before GPT-4/Claude Sonnet processing, reducing frontier model calls by 80% with <2% recall loss; implement hybrid scoring where Haiku handles obvious rejects and Sonnet handles borderline cases.
Journey Context:
In RAG pipelines, feeding 10 retrieved documents to GPT-4 for final answer synthesis costs $0.30-0.50 per query \(4k tokens @ $30/1M\). Haiku processing same 10 docs for relevance scoring costs $0.03 \(4k tokens @ $0.25/1M\+ $1.25/1M output\). By having Haiku filter top-3 docs then send only those to Sonnet, total cost drops to $0.09 \(Haiku filter \+ Sonnet synthesis\) vs $0.45 \(Sonnet only\). Quality impact: Haiku misses 2% of edge-case relevant docs that Sonnet would catch, but 80% cost reduction justifies the tradeoff for high-volume applications. Critical: Haiku must be prompted with exact same relevance criteria as Sonnet to minimize distribution shift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:01:21.513306+00:00— report_created — created