Report #30504
[cost\_intel] Extended thinking mode charges for hidden reasoning tokens not returned in output
Monitor 'thinking' token usage separately from input/output; cap thinking budget to 1024 tokens for simple queries; disable extended thinking for straightforward classification.
Journey Context:
Anthropic's Claude 3.7 Sonnet 'extended thinking' mode generates internal reasoning chains that are counted as output tokens and billed, but are not returned in the API response \(they're in a separate thinking block\). A user might send 1K tokens and receive 500 visible tokens, but be billed for 5K tokens because the model 'thought' for 4.5K tokens internally. The trap is assuming the 'output tokens' field in the billing dashboard matches visible output. The fix is to set a thinking budget \(max\_tokens for thinking\) aggressively low for simple tasks, and to monitor the usage.thinking\_tokens field specifically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:35:10.874805+00:00— report_created — created