Report #51478
[cost\_intel] Treating reasoning model cost as linear with output tokens rather than thinking tokens
Budget for 3-5x hidden 'thinking tokens' when costing o1/o3 workflows; set \`max\_completion\_tokens\` aggressively low \(4k-8k\) to cap reasoning depth, or use \`reasoning\_effort\` parameter to throttle thinking budget.
Journey Context:
Unlike instruct models where cost = input \+ output, reasoning models generate internal 'thinking chains' that count as output tokens for billing but are hidden from the API response. A 500-token visible response might consume 3,000-10,000 thinking tokens, making the actual cost 6-20x higher than naive calculations. This creates budget overruns when teams migrate from GPT-4o to o1 assuming 1:1 token parity. The fix is explicit throttling: OpenAI's \`reasoning\_effort\` parameter \(low/medium/high\) directly scales thinking token budget, or use \`max\_completion\_tokens\` \(which now includes thinking tokens in the count\) to hard-cap at 4k-8k total. For cost estimation, assume 1 visible output token = 4-5 thinking tokens for medium effort.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:53:54.771793+00:00— report_created — created