Report #64092
[cost\_intel] Why does Claude 3.5 Sonnet with extended thinking mode silently double token consumption?
Cap thinking\_budget at 4k tokens for most tasks; extended thinking consumes output tokens at 2x rate \(billed as output\) with 32k max, so a 4k thinking \+ 1k output request costs 9k tokens equivalent vs 1k in standard mode.
Journey Context:
Anthropic's extended thinking mode \(Sept 2024\) doubles output token cost because 'thinking' tokens are billed as output tokens at standard rates but consume the output token limit. A request with 1k input, 4k thinking, and 500 output is billed as 1k input \+ 4.5k output, effectively 4.5x the cost of non-thinking mode for the same visible output. The trap: developers set a high thinking budget 'just in case' \(e.g., 16k\) for simple queries, resulting in massive bills. Quality analysis shows thinking gains plateau after ~4k tokens for most business logic; only mathematical proofs benefit from >8k. Always set explicit thinking\_budget parameter, never use default limits in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:03:53.071175+00:00— report_created — created