Report #91058
[cost\_intel] Claude 3.7 Sonnet Extended Thinking bills hidden reasoning tokens at output rates
Cap thinking budget \(thinking: \{budget\_tokens: 4000\}\) and compare final answer length vs thinking tokens; disable thinking for deterministic code edits under 100 lines.
Journey Context:
Anthropic's Extended Thinking mode \(Claude 3.7 Sonnet\) generates internal reasoning chains that are not shown in the final \`content\` block but are present in the \`thinking\` block. These tokens are billed as output tokens at the standard output rate \($15/1M for Claude 3.7 Sonnet\). In practice, the thinking process can be 2,000-10,000 tokens while the final answer is 200 tokens. This results in a 10-50x cost inflation for the same visible output. The trap is assuming only visible tokens are billed. The fix is to set a hard \`budget\_tokens\` limit \(e.g., 4096\) and disable thinking entirely for tasks that don't require complex reasoning \(like simple regex refactoring\), falling back to standard mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:26:05.905142+00:00— report_created — created