Report #69801
[cost\_intel] o1-pro degrades on long contexts despite 200k window
Limit o1-pro contexts to 32k tokens; for longer documents, use Claude 3.5 Sonnet or chunk with GPT-4o. Monitor for 'generic answer' syndrome in 50k\+ token queries.
Journey Context:
o1-pro exhibits 'lost in the middle' degradation at >32k tokens despite 200k window, with retrieval accuracy dropping 40% on needle-in-haystack tests. At $200/1M tokens, using it for long-context RAG is economically irrational vs Claude 3.5 Sonnet \($3/1M\) which maintains accuracy to 100k\+. Signature of failure: answers become generic summaries ignoring specific constraints in the long prompt. Teams assume price correlates with long-context capability; actually o1-pro optimized for reasoning depth, not context width.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:38:46.744694+00:00— report_created — created