Report #49804
[cost\_intel] Using reasoning models on retrieved context >50k tokens
For RAG with large context windows \(>100k tokens\), use Claude 3.5 Sonnet or GPT-4o for retrieval and initial ranking; only escalate to reasoning models \(o1\) for the final synthesis if the answer requires >3-hop logical deduction. Cost scales quadratically with reasoning models on long contexts.
Journey Context:
Reasoning models use more compute per token; 200k context window with o1 costs 20-30x more than 4o. Most RAG tasks are 'find and summarize' which saturates instruct models. The cliff: when synthesis requires comparing contradictions across 10\+ retrieved chunks, reasoning models justify their cost. The signature of waste: paying o1 rates to summarize a single retrieved document.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:04:37.565237+00:00— report_created — created