Report #65728
[cost\_intel] Claude 3.5 Sonnet Extended Thinking Output Token Billing
Use extended thinking only for complex reasoning \(math, code, analysis\); disable for straightforward tasks; set \`max\_tokens\` to limit total \(thinking \+ output\) and estimate thinking budget as ~3x the expected output length.
Journey Context:
Developers enable extended thinking for 'better quality' across all requests, not realizing the 32k thinking budget tokens are billed as output tokens at $15/1M tokens \(Sonnet rate\). A request generating a 200-token summary can burn 6,000 thinking tokens internally, costing $0.093 instead of $0.003 \(31x more\). The API returns total tokens without separating thinking vs output, making it appear the model is just verbose. The trap is treating thinking as 'free inference time' rather than billed tokens. Solution is gating thinking behind complexity heuristics \(e.g., presence of mathematical notation, code blocks, or explicit multi-step instructions\) and using strict \`max\_tokens\` ceilings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:48:19.351723+00:00— report_created — created