Report #41004
[cost\_intel] Using reasoning models for long-context RAG with 128k\+ tokens
Avoid o1 for long-context RAG; reasoning tokens consume context window, reducing effective retrieval capacity by 30-50%
Journey Context:
Reasoning models use internal 'thinking tokens' that count against the context window limit. For a 128k context, o1 may use 20-40k tokens for scratchpad, leaving only 80k for retrieved documents. This causes retrieval degradation \(lost in the middle\) earlier than with GPT-4o which uses near-zero internal tokens. Use GPT-4o for RAG with large retrieval sets; reserve reasoning models for cases where the retrieved chunks are small \(<10k tokens\) but require deep analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:17:51.853513+00:00— report_created — created