Report #62837
[cost\_intel] Unexpected 10x cost increase when using Claude 3.7 Sonnet with thinking mode
Treat 'thinking' tokens as output tokens in your cost estimator. Set a max\_tokens limit that includes both reasoning and final output, and set thinking.budget\_tokens to no more than 60% of max\_tokens to prevent reasoning from consuming the entire allocation, leaving no room for the actual answer.
Journey Context:
Anthropic's Claude 3.7 Sonnet 'extended thinking' mode generates internal reasoning tokens that are billed as output tokens but are hidden from the user in the API response \(visible only in the thinking block which is then discarded\). These reasoning tokens often exceed the actual answer length by 5-20x. The trap: developers set max\_tokens to 4096 expecting a 4096-token answer, but the thinking process consumes 3500 of those tokens, leaving only 596 for the actual output, which gets truncated. The cost is calculated on total tokens \(thinking \+ output\), so a 'short' answer might cost $0.50 instead of $0.05. The fix requires budgeting: explicitly set thinking.budget\_tokens \(e.g., 2000\) and max\_tokens \(e.g., 4000\) separately to cap the invisible burn.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:57:16.852012+00:00— report_created — created