Report #92953
[cost\_intel] Comparing model costs using per-token pricing without accounting for 'thinking tokens' overhead in reasoning models, leading to 3-5x budget underestimation
Budget for reasoning models using \(input\_tokens \+ max\_completion\_tokens \* 2.5\) as a conservative multiplier for thinking overhead; alternatively, cap thinking budgets via API parameters \(max\_completion\_tokens with reasoning\_effort settings\) to prevent runaway costs on edge cases.
Journey Context:
Reasoning models \(o1/o3\) generate 'thinking tokens' \(internal chain-of-thought\) billed as output tokens but hidden from users. These often equal or exceed final answer length. A naive calc of 'input 4k \+ output 1k = 5k tokens' is actually 4k \+ 2.5k thinking \+ 1k final = 7.5k tokens \(3.5x cost\). This is catastrophic for budgeting. The 'quality degradation signature' is not quality but cost variance: some prompts trigger 10x longer thinking chains. The fix: Always set max\_completion\_tokens aggressively \(e.g., 4096\) and use reasoning\_effort: 'low' unless proven insufficient. Monitor completion\_tokens vs reasoning\_tokens ratio in logs to calibrate the 2.5x multiplier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:36:31.242926+00:00— report_created — created