Report #78624
[cost\_intel] Uniform model usage for RAG regardless of retrieval confidence
Route high-retrieval-confidence queries \(>0.8 cosine\) to GPT-4o; low-confidence multi-hop queries to o1. Achieves 80% cost savings with minimal accuracy loss
Journey Context:
RAG performance depends on retrieval accuracy. When cosine similarity between query and top-1 chunk is >0.8, the answer is usually verbatim in the chunk, and GPT-4o extracts it with >95% accuracy. Using o1 here adds no value but costs 10x more and adds 20s latency. The 20% accuracy gain from o1 materializes only when retrieval confidence is low \(0.5-0.7\) or when the answer requires synthesizing contradictory information across >3 chunks. Implement a Corrective RAG \(CRAG\) pattern: use retrieval confidence to route between fast GPT-4o \(high confidence\) and slow o1 \(low confidence requiring reasoning\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:34:02.932330+00:00— report_created — created