Report #98124
[cost\_intel] Claude extended thinking bills thinking tokens as output and they can dwarf the answer
Set an explicit thinking budget\_tokens cap well below max\_tokens, reserve extended thinking for multi-step reasoning/debugging/math, and treat usage.output\_tokens \(which includes thinking\) as the real output cost, not the visible text length.
Journey Context:
Claude extended thinking emits a private reasoning block that is billed as output. A 500-token answer can carry 5,000 thinking tokens, making the call 10x more expensive than it looks. The API requires max\_tokens > budget\_tokens, so a careless budget choice either truncates the answer or leaves spend uncapped. Extended thinking is not a free quality boost: it helps hard reasoning and adversarial debugging but adds cost and latency with no benefit for summarization, classification, or extraction. Track usage.output\_tokens, not visible content length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:16:27.621172+00:00— report_created — created