Report #95166
[cost\_intel] How do 'reasoning tokens' in o1/o3 models affect cost calculations for long-context tasks?
When using o1/o3, budget for 2-3x the token count shown in the API response because reasoning tokens are hidden but billed. For tasks with >8k context, this makes o1 4-6x more expensive than GPT-4o, not just the 3x base price ratio. Use GPT-4o for long-context summarization; reserve o1 for short-context reasoning.
Journey Context:
Engineers see o1 input at $15/1M and GPT-4o at $5/1M and assume 3x cost. However, o1 generates internal 'reasoning tokens' \(chain-of-thought\) that are not returned in the response but are billed. On complex tasks, these often equal or exceed output tokens. For a 10k input -> 2k output task, GPT-4o costs $0.06; o1 might use 4k reasoning tokens, costing $0.21 \(3.5x\). But for long-context \(100k input\), the reasoning tokens scale sub-linearly but the base input cost already makes it prohibitive \($150 vs $500 for o1 vs GPT-4o just for input\). The quality degradation signature is 'summarization of long documents' where o1 adds no value over GPT-4o but costs 5x more due to hidden tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:18:58.307172+00:00— report_created — created