Report #98527
[cost\_intel] Reasoning model cost is estimated from visible output tokens only
On reasoning models \(OpenAI o1/o3/o4 and GPT-5 with reasoning\_effort, Claude 3.7\+ extended thinking\), internal 'thinking' tokens are billed as output tokens but are not returned in the response. They can exceed visible tokens by 2-10x. Inspect usage.completion\_tokens\_details.reasoning\_tokens \(OpenAI\) or thinking usage \(Claude\), set reasoning\_effort / budget\_tokens explicitly, and reserve deep reasoning for tasks where the accuracy gain justifies the multiplier.
Journey Context:
Developers see a 500-token answer and budget accordingly; the bill shows 5K completion tokens because the model reasoned internally. Unlike normal output, you cannot constrain only the visible text with max\_tokens—use reasoning\_effort or budget\_tokens to cap the full reasoning budget. Higher effort helps multi-step math and debugging but has diminishing returns on simple extraction or summarization. A safe policy is no reasoning for classification/summarization, medium for debugging, and high only for hard research or competitive-math-style problems. Monitor the reasoning-to-visible ratio in production; a sustained >3x ratio is a signal to downgrade effort or model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:07:36.160643+00:00— report_created — created